story-research-zapwall/fixKnowledge/ssh-connection-errors-deployment.md
2026-01-06 14:24:35 +01:00

5.5 KiB

SSH Connection Errors During Deployment

Date: 2024-12-19 Auteur: Équipe 4NK

Problem Description

During deployment, SSH connection errors occurred when verifying the Git repository on the server. The errors were:

mux_client_request_session: read from master failed: Connection reset by peer
Failed to connect to new control master
mm_send_fd: sendmsg(2): Broken pipe
mux_client_request_session: send fds failed

These errors appeared at step 5 of the deployment script when checking if Git is initialized on the server.

Impact

  • Severity: Medium
  • Scope: Deployment script reliability
  • User Impact: Deployment could fail or continue with errors, potentially leaving the server in an inconsistent state
  • Frequency: Intermittent, occurring when SSH ControlMaster connection is interrupted

Root Cause

The SSH ControlMaster multiplexing connection was being closed prematurely or becoming stale, causing subsequent SSH commands to fail. The original ssh_exec function did not handle connection failures robustly:

  1. No connection validation: The function did not check if the ControlMaster socket was still valid before use
  2. No retry mechanism: Failed connections were not retried after cleanup
  3. No dead connection cleanup: Stale connections were not detected and cleaned up before reuse
  4. Silent failures: Connection errors in conditional checks could be misinterpreted as command failures

Root Cause Analysis

The SSH ControlMaster feature creates a persistent connection to avoid multiple SSH handshakes. However:

  • Network interruptions can close the master connection
  • The ControlMaster socket file can become stale if the connection dies
  • The script did not detect or handle these cases, leading to cascading failures

Corrections Applied

1. Enhanced SSH Connection Management

File: deploy.sh

Changes:

  • Added cleanup_dead_ssh() function to properly clean up dead SSH connections
  • Added check_ssh_connection() function to validate ControlMaster connection before use
  • Enhanced ssh_exec() function with:
    • Connection validation before each command
    • Automatic cleanup of dead connections
    • Retry mechanism (up to 3 attempts)
    • Additional SSH options for better connection stability:
      • ConnectTimeout=10: Fail fast if connection cannot be established
      • ServerAliveInterval=60: Keep connection alive
      • ServerAliveCountMax=3: Detect dead connections quickly

2. Improved Error Handling at Step 5

File: deploy.sh

Changes:

  • Enhanced Git repository verification to properly handle SSH connection errors
  • Added explicit error detection and recovery mechanism
  • Added automatic retry after connection cleanup
  • Better error messages to distinguish between connection errors and Git initialization needs

Modifications

Files Modified

  • deploy.sh:
    • Enhanced ssh_exec() function with retry logic and connection validation
    • Added cleanup_dead_ssh() and check_ssh_connection() helper functions
    • Improved error handling in Git repository verification step

Code Changes

Before:

ssh_exec() {
    ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
}

After:

ssh_exec() {
    # Validates connection, cleans up dead connections, retries on failure
    # Includes connection stability options
}

Deployment Procedures

Automatic Deployment

The fix is automatically applied when using the deployment script:

./deploy.sh "commit message"

No manual intervention required. The script now handles SSH connection errors automatically.

Verification

After deployment, verify that SSH connections are stable:

  1. Check that deployment completes without SSH errors
  2. Monitor for connection errors in subsequent deployments
  3. Verify that retry mechanism works correctly

Analysis Procedures

Monitoring SSH Connection Issues

  1. Check deployment logs for SSH connection errors:

    # Review recent deployment output
    
  2. Verify SSH ControlMaster socket:

    # On the deployment machine
    ls -la /tmp/ssh_control_*/
    
  3. Test SSH connection manually:

    ssh -O check -o ControlPath="/tmp/ssh_control_*/debian_92.243.27.35_22" debian@92.243.27.35
    

Debugging Steps

If SSH connection errors persist:

  1. Check network connectivity to the server
  2. Verify SSH server configuration allows ControlMaster
  3. Check for firewall or network issues
  4. Review SSH server logs on the remote machine
  5. Verify SSH key authentication is working

Logs to Review

  • Deployment script output (stdout/stderr)
  • SSH client logs (if verbose mode enabled)
  • Remote SSH server logs: /var/log/auth.log or similar

Prevention

Best Practices

  1. Connection validation: Always validate SSH connections before use
  2. Retry logic: Implement retry mechanisms for network operations
  3. Cleanup: Properly clean up stale connections
  4. Error handling: Distinguish between different types of failures
  5. Monitoring: Monitor connection stability over time

Future Improvements

  • Add connection health metrics
  • Implement exponential backoff for retries
  • Add connection pooling if needed
  • Consider alternative connection methods if ControlMaster proves unreliable

None identified at this time.

References

  • SSH ControlMaster documentation
  • Deployment script: deploy.sh
  • Related documentation: docs/deployment.md