**Motivations:** - SSH ControlMaster connection errors were causing deployment failures - Connection reset errors were not handled properly - No retry mechanism for failed SSH connections **Root causes:** - SSH ControlMaster socket could become stale or be closed prematurely - No validation of connection before use - No cleanup of dead connections - Silent failures in conditional checks **Correctifs:** - Added connection validation before each SSH command - Implemented automatic cleanup of dead SSH connections - Added retry mechanism (up to 3 attempts) with connection cleanup - Enhanced SSH options for better connection stability (ConnectTimeout, ServerAliveInterval, ServerAliveCountMax) - Improved error handling in Git repository verification step with explicit error detection and recovery **Evolutions:** - Enhanced SSH connection management with robust error handling - Better error messages to distinguish connection errors from other failures **Pages affectées:** - deploy.sh: Enhanced ssh_exec() function, added helper functions, improved error handling - fixKnowledge/ssh-connection-errors-deployment.md: Documentation of the problem, root cause, and solution
5.5 KiB
SSH Connection Errors During Deployment
Date: 2024-12-19
Auteur: Équipe 4NK
Problem Description
During deployment, SSH connection errors occurred when verifying the Git repository on the server. The errors were:
mux_client_request_session: read from master failed: Connection reset by peer
Failed to connect to new control master
mm_send_fd: sendmsg(2): Broken pipe
mux_client_request_session: send fds failed
These errors appeared at step 5 of the deployment script when checking if Git is initialized on the server.
Impact
- Severity: Medium
- Scope: Deployment script reliability
- User Impact: Deployment could fail or continue with errors, potentially leaving the server in an inconsistent state
- Frequency: Intermittent, occurring when SSH ControlMaster connection is interrupted
Root Cause
The SSH ControlMaster multiplexing connection was being closed prematurely or becoming stale, causing subsequent SSH commands to fail. The original ssh_exec function did not handle connection failures robustly:
- No connection validation: The function did not check if the ControlMaster socket was still valid before use
- No retry mechanism: Failed connections were not retried after cleanup
- No dead connection cleanup: Stale connections were not detected and cleaned up before reuse
- Silent failures: Connection errors in conditional checks could be misinterpreted as command failures
Root Cause Analysis
The SSH ControlMaster feature creates a persistent connection to avoid multiple SSH handshakes. However:
- Network interruptions can close the master connection
- The ControlMaster socket file can become stale if the connection dies
- The script did not detect or handle these cases, leading to cascading failures
Corrections Applied
1. Enhanced SSH Connection Management
File: deploy.sh
Changes:
- Added
cleanup_dead_ssh()function to properly clean up dead SSH connections - Added
check_ssh_connection()function to validate ControlMaster connection before use - Enhanced
ssh_exec()function with:- Connection validation before each command
- Automatic cleanup of dead connections
- Retry mechanism (up to 3 attempts)
- Additional SSH options for better connection stability:
ConnectTimeout=10: Fail fast if connection cannot be establishedServerAliveInterval=60: Keep connection aliveServerAliveCountMax=3: Detect dead connections quickly
2. Improved Error Handling at Step 5
File: deploy.sh
Changes:
- Enhanced Git repository verification to properly handle SSH connection errors
- Added explicit error detection and recovery mechanism
- Added automatic retry after connection cleanup
- Better error messages to distinguish between connection errors and Git initialization needs
Modifications
Files Modified
deploy.sh:- Enhanced
ssh_exec()function with retry logic and connection validation - Added
cleanup_dead_ssh()andcheck_ssh_connection()helper functions - Improved error handling in Git repository verification step
- Enhanced
Code Changes
Before:
ssh_exec() {
ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
}
After:
ssh_exec() {
# Validates connection, cleans up dead connections, retries on failure
# Includes connection stability options
}
Deployment Procedures
Automatic Deployment
The fix is automatically applied when using the deployment script:
./deploy.sh "commit message"
No manual intervention required. The script now handles SSH connection errors automatically.
Verification
After deployment, verify that SSH connections are stable:
- Check that deployment completes without SSH errors
- Monitor for connection errors in subsequent deployments
- Verify that retry mechanism works correctly
Analysis Procedures
Monitoring SSH Connection Issues
-
Check deployment logs for SSH connection errors:
# Review recent deployment output -
Verify SSH ControlMaster socket:
# On the deployment machine ls -la /tmp/ssh_control_*/ -
Test SSH connection manually:
ssh -O check -o ControlPath="/tmp/ssh_control_*/debian_92.243.27.35_22" debian@92.243.27.35
Debugging Steps
If SSH connection errors persist:
- Check network connectivity to the server
- Verify SSH server configuration allows ControlMaster
- Check for firewall or network issues
- Review SSH server logs on the remote machine
- Verify SSH key authentication is working
Logs to Review
- Deployment script output (stdout/stderr)
- SSH client logs (if verbose mode enabled)
- Remote SSH server logs:
/var/log/auth.logor similar
Prevention
Best Practices
- Connection validation: Always validate SSH connections before use
- Retry logic: Implement retry mechanisms for network operations
- Cleanup: Properly clean up stale connections
- Error handling: Distinguish between different types of failures
- Monitoring: Monitor connection stability over time
Future Improvements
- Add connection health metrics
- Implement exponential backoff for retries
- Add connection pooling if needed
- Consider alternative connection methods if ControlMaster proves unreliable
Related Issues
None identified at this time.
References
- SSH ControlMaster documentation
- Deployment script:
deploy.sh - Related documentation:
docs/deployment.md