**Motivations:** - Document the deep root cause of ControlSocket already exists errors - Explain why ControlMaster=auto doesn't remove invalid sockets - Document the solution to aggressively clean up dead sockets **Root causes:** - Documentation needed to explain the root cause of socket invalidation - Need to document why directory removal is necessary for proper cleanup **Correctifs:** - Added deep root cause analysis section explaining ControlMaster=auto behavior - Documented why dead sockets accumulate and cause subsequent failures - Updated corrections section with latest improvements **Evolutions:** - More complete documentation of the problem and solution **Pages affectées:** - fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections
211 lines
7.5 KiB
Markdown
211 lines
7.5 KiB
Markdown
# SSH Connection Errors During Deployment
|
|
|
|
**Date**: 2024-12-19
|
|
**Auteur**: Équipe 4NK
|
|
|
|
## Problem Description
|
|
|
|
During deployment, SSH connection errors occurred when verifying the Git repository on the server. The errors were:
|
|
|
|
```
|
|
mux_client_request_session: read from master failed: Connection reset by peer
|
|
Failed to connect to new control master
|
|
mm_send_fd: sendmsg(2): Broken pipe
|
|
mux_client_request_session: send fds failed
|
|
```
|
|
|
|
These errors appeared at step 5 of the deployment script when checking if Git is initialized on the server.
|
|
|
|
Additionally, the script could hang indefinitely at step 5 due to command substitution blocking when SSH connections failed or timed out.
|
|
|
|
## Impact
|
|
|
|
- **Severity**: Medium
|
|
- **Scope**: Deployment script reliability
|
|
- **User Impact**: Deployment could fail or continue with errors, potentially leaving the server in an inconsistent state
|
|
- **Frequency**: Intermittent, occurring when SSH ControlMaster connection is interrupted
|
|
|
|
## Root Cause
|
|
|
|
The SSH ControlMaster multiplexing connection was being closed prematurely or becoming stale, causing subsequent SSH commands to fail. The original `ssh_exec` function did not handle connection failures robustly:
|
|
|
|
1. **No connection validation**: The function did not check if the ControlMaster socket was still valid before use
|
|
2. **No retry mechanism**: Failed connections were not retried after cleanup
|
|
3. **No dead connection cleanup**: Stale connections were not detected and cleaned up before reuse
|
|
4. **Silent failures**: Connection errors in conditional checks could be misinterpreted as command failures
|
|
|
|
## Root Cause Analysis
|
|
|
|
The SSH ControlMaster feature creates a persistent connection to avoid multiple SSH handshakes. However:
|
|
|
|
- Network interruptions can close the master connection
|
|
- The ControlMaster socket file can become stale if the connection dies
|
|
- The script did not detect or handle these cases, leading to cascading failures
|
|
- Command substitution `$()` with SSH commands can block indefinitely if the connection hangs
|
|
- The original logic used `grep` on command output, which required capturing all output and could hang
|
|
|
|
### Deep Root Cause: ControlSocket Already Exists
|
|
|
|
The fundamental issue is that:
|
|
|
|
1. **Socket can become invalid between check and execution**: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution.
|
|
|
|
2. **ControlMaster=auto behavior**: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it:
|
|
- Disables multiplexing for that command
|
|
- Creates a new connection without multiplexing
|
|
- Leaves the dead socket file in place
|
|
|
|
3. **Dead socket accumulation**: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors.
|
|
|
|
4. **Insufficient cleanup**: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup.
|
|
|
|
The solution is to:
|
|
- Test actual connection usability (not just socket existence)
|
|
- Aggressively remove the entire socket directory when invalid
|
|
- Recreate the directory for fresh connections
|
|
|
|
## Corrections Applied
|
|
|
|
### 1. Enhanced SSH Connection Management
|
|
|
|
**File**: `deploy.sh`
|
|
|
|
**Changes**:
|
|
- Added `cleanup_dead_ssh()` function to properly clean up dead SSH connections
|
|
- Added `check_ssh_connection()` function to validate ControlMaster connection before use
|
|
- Enhanced `ssh_exec()` function with:
|
|
- Connection validation before each command
|
|
- Automatic cleanup of dead connections
|
|
- Additional SSH options for better connection stability:
|
|
- `ConnectTimeout=10`: Fail fast if connection cannot be established
|
|
- `ServerAliveInterval=60`: Keep connection alive
|
|
- `ServerAliveCountMax=3`: Detect dead connections quickly
|
|
|
|
**Latest improvements (root cause fix)**:
|
|
- `check_ssh_connection()` now tests actual connection usability with `ssh ... true` instead of just `ssh -O check`
|
|
- `cleanup_dead_ssh()` now removes entire socket directory instead of just socket file
|
|
- This ensures socket is truly removed even if process is still holding it
|
|
- Directory is recreated after cleanup for fresh connections
|
|
|
|
### 2. Improved Error Handling at Step 5
|
|
|
|
**File**: `deploy.sh`
|
|
|
|
**Changes**:
|
|
- Enhanced Git repository verification to properly handle SSH connection errors
|
|
- Added explicit error detection and recovery mechanism
|
|
- Added automatic retry after connection cleanup
|
|
- Better error messages to distinguish between connection errors and Git initialization needs
|
|
- **Fixed blocking issue**: Removed command substitution `$()` that could hang indefinitely
|
|
- Simplified logic to use direct exit code checking instead of parsing output
|
|
- Improved timeout handling to prevent script from hanging
|
|
|
|
## Modifications
|
|
|
|
### Files Modified
|
|
|
|
- `deploy.sh`:
|
|
- Enhanced `ssh_exec()` function with retry logic and connection validation
|
|
- Added `cleanup_dead_ssh()` and `check_ssh_connection()` helper functions
|
|
- Improved error handling in Git repository verification step
|
|
|
|
### Code Changes
|
|
|
|
**Before**:
|
|
```bash
|
|
ssh_exec() {
|
|
ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
|
|
}
|
|
```
|
|
|
|
**After**:
|
|
```bash
|
|
ssh_exec() {
|
|
# Validates connection, cleans up dead connections, retries on failure
|
|
# Includes connection stability options
|
|
}
|
|
```
|
|
|
|
## Deployment Procedures
|
|
|
|
### Automatic Deployment
|
|
|
|
The fix is automatically applied when using the deployment script:
|
|
|
|
```bash
|
|
./deploy.sh "commit message"
|
|
```
|
|
|
|
No manual intervention required. The script now handles SSH connection errors automatically.
|
|
|
|
### Verification
|
|
|
|
After deployment, verify that SSH connections are stable:
|
|
|
|
1. Check that deployment completes without SSH errors
|
|
2. Monitor for connection errors in subsequent deployments
|
|
3. Verify that retry mechanism works correctly
|
|
|
|
## Analysis Procedures
|
|
|
|
### Monitoring SSH Connection Issues
|
|
|
|
1. **Check deployment logs** for SSH connection errors:
|
|
```bash
|
|
# Review recent deployment output
|
|
```
|
|
|
|
2. **Verify SSH ControlMaster socket**:
|
|
```bash
|
|
# On the deployment machine
|
|
ls -la /tmp/ssh_control_*/
|
|
```
|
|
|
|
3. **Test SSH connection manually**:
|
|
```bash
|
|
ssh -O check -o ControlPath="/tmp/ssh_control_*/debian_92.243.27.35_22" debian@92.243.27.35
|
|
```
|
|
|
|
### Debugging Steps
|
|
|
|
If SSH connection errors persist:
|
|
|
|
1. Check network connectivity to the server
|
|
2. Verify SSH server configuration allows ControlMaster
|
|
3. Check for firewall or network issues
|
|
4. Review SSH server logs on the remote machine
|
|
5. Verify SSH key authentication is working
|
|
|
|
### Logs to Review
|
|
|
|
- Deployment script output (stdout/stderr)
|
|
- SSH client logs (if verbose mode enabled)
|
|
- Remote SSH server logs: `/var/log/auth.log` or similar
|
|
|
|
## Prevention
|
|
|
|
### Best Practices
|
|
|
|
1. **Connection validation**: Always validate SSH connections before use
|
|
2. **Retry logic**: Implement retry mechanisms for network operations
|
|
3. **Cleanup**: Properly clean up stale connections
|
|
4. **Error handling**: Distinguish between different types of failures
|
|
5. **Monitoring**: Monitor connection stability over time
|
|
|
|
### Future Improvements
|
|
|
|
- Add connection health metrics
|
|
- Implement exponential backoff for retries
|
|
- Add connection pooling if needed
|
|
- Consider alternative connection methods if ControlMaster proves unreliable
|
|
|
|
## Related Issues
|
|
|
|
None identified at this time.
|
|
|
|
## References
|
|
|
|
- SSH ControlMaster documentation
|
|
- Deployment script: `deploy.sh`
|
|
- Related documentation: `docs/deployment.md`
|