Nicolas Cantu f02b3938a1 Update SSH connection errors documentation with root cause analysis

**Motivations:**
- Document the deep root cause of ControlSocket already exists errors
- Explain why ControlMaster=auto doesn't remove invalid sockets
- Document the solution to aggressively clean up dead sockets

**Root causes:**
- Documentation needed to explain the root cause of socket invalidation
- Need to document why directory removal is necessary for proper cleanup

**Correctifs:**
- Added deep root cause analysis section explaining ControlMaster=auto behavior
- Documented why dead sockets accumulate and cause subsequent failures
- Updated corrections section with latest improvements

**Evolutions:**
- More complete documentation of the problem and solution

**Pages affectées:**
- fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections

2026-01-06 14:37:19 +01:00

7.5 KiB

Raw Blame History

SSH Connection Errors During Deployment

Date: 2024-12-19 Auteur: Équipe 4NK

Problem Description

During deployment, SSH connection errors occurred when verifying the Git repository on the server. The errors were:

mux_client_request_session: read from master failed: Connection reset by peer
Failed to connect to new control master
mm_send_fd: sendmsg(2): Broken pipe
mux_client_request_session: send fds failed

These errors appeared at step 5 of the deployment script when checking if Git is initialized on the server.

Additionally, the script could hang indefinitely at step 5 due to command substitution blocking when SSH connections failed or timed out.

Impact

Severity: Medium
Scope: Deployment script reliability
User Impact: Deployment could fail or continue with errors, potentially leaving the server in an inconsistent state
Frequency: Intermittent, occurring when SSH ControlMaster connection is interrupted

Root Cause

The SSH ControlMaster multiplexing connection was being closed prematurely or becoming stale, causing subsequent SSH commands to fail. The original ssh_exec function did not handle connection failures robustly:

No connection validation: The function did not check if the ControlMaster socket was still valid before use
No retry mechanism: Failed connections were not retried after cleanup
No dead connection cleanup: Stale connections were not detected and cleaned up before reuse
Silent failures: Connection errors in conditional checks could be misinterpreted as command failures

Root Cause Analysis

The SSH ControlMaster feature creates a persistent connection to avoid multiple SSH handshakes. However:

Network interruptions can close the master connection
The ControlMaster socket file can become stale if the connection dies
The script did not detect or handle these cases, leading to cascading failures
Command substitution $() with SSH commands can block indefinitely if the connection hangs
The original logic used grep on command output, which required capturing all output and could hang

Deep Root Cause: ControlSocket Already Exists

The fundamental issue is that:

Socket can become invalid between check and execution: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution.
ControlMaster=auto behavior: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it:
- Disables multiplexing for that command
- Creates a new connection without multiplexing
- Leaves the dead socket file in place
Dead socket accumulation: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors.
Insufficient cleanup: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup.

The solution is to:

Test actual connection usability (not just socket existence)
Aggressively remove the entire socket directory when invalid
Recreate the directory for fresh connections

Corrections Applied

1. Enhanced SSH Connection Management

File: deploy.sh

Changes:

Added cleanup_dead_ssh() function to properly clean up dead SSH connections
Added check_ssh_connection() function to validate ControlMaster connection before use
Enhanced ssh_exec() function with:
- Connection validation before each command
- Automatic cleanup of dead connections
- Additional SSH options for better connection stability:
  - ConnectTimeout=10: Fail fast if connection cannot be established
  - ServerAliveInterval=60: Keep connection alive
  - ServerAliveCountMax=3: Detect dead connections quickly

Latest improvements (root cause fix):

check_ssh_connection() now tests actual connection usability with ssh ... true instead of just ssh -O check
cleanup_dead_ssh() now removes entire socket directory instead of just socket file
This ensures socket is truly removed even if process is still holding it
Directory is recreated after cleanup for fresh connections

2. Improved Error Handling at Step 5

File: deploy.sh

Changes:

Enhanced Git repository verification to properly handle SSH connection errors
Added explicit error detection and recovery mechanism
Added automatic retry after connection cleanup
Better error messages to distinguish between connection errors and Git initialization needs
Fixed blocking issue: Removed command substitution $() that could hang indefinitely
Simplified logic to use direct exit code checking instead of parsing output
Improved timeout handling to prevent script from hanging

Modifications

Files Modified

deploy.sh:
- Enhanced ssh_exec() function with retry logic and connection validation
- Added cleanup_dead_ssh() and check_ssh_connection() helper functions
- Improved error handling in Git repository verification step

Code Changes

Before:

ssh_exec() {
    ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
}

After:

ssh_exec() {
    # Validates connection, cleans up dead connections, retries on failure
    # Includes connection stability options
}

Deployment Procedures

Automatic Deployment

The fix is automatically applied when using the deployment script:

./deploy.sh "commit message"

No manual intervention required. The script now handles SSH connection errors automatically.

Verification

After deployment, verify that SSH connections are stable:

Check that deployment completes without SSH errors
Monitor for connection errors in subsequent deployments
Verify that retry mechanism works correctly

Analysis Procedures

Monitoring SSH Connection Issues

Check deployment logs for SSH connection errors:
```
# Review recent deployment output
```

Verify SSH ControlMaster socket:

# On the deployment machine
ls -la /tmp/ssh_control_*/

Test SSH connection manually:

ssh -O check -o ControlPath="/tmp/ssh_control_*/debian_92.243.27.35_22" debian@92.243.27.35

Debugging Steps

If SSH connection errors persist:

Check network connectivity to the server
Verify SSH server configuration allows ControlMaster
Check for firewall or network issues
Review SSH server logs on the remote machine
Verify SSH key authentication is working

Logs to Review

Deployment script output (stdout/stderr)
SSH client logs (if verbose mode enabled)
Remote SSH server logs: /var/log/auth.log or similar

Prevention

Best Practices

Connection validation: Always validate SSH connections before use
Retry logic: Implement retry mechanisms for network operations
Cleanup: Properly clean up stale connections
Error handling: Distinguish between different types of failures
Monitoring: Monitor connection stability over time

Future Improvements

Add connection health metrics
Implement exponential backoff for retries
Add connection pooling if needed
Consider alternative connection methods if ControlMaster proves unreliable

None identified at this time.

References

SSH ControlMaster documentation
Deployment script: deploy.sh
Related documentation: docs/deployment.md

7.5 KiB Raw Blame History