**Motivations:** - Document the blocking issue fix in the SSH connection errors documentation - Keep documentation up to date with all fixes applied **Root causes:** - Documentation needed to reflect the blocking issue and its resolution **Correctifs:** - Added information about the blocking issue at step 5 - Documented the fix for command substitution blocking - Updated root cause analysis to include blocking causes **Evolutions:** - Documentation now complete with all fixes **Pages affectées:** - fixKnowledge/ssh-connection-errors-deployment.md: Added blocking issue documentation
6.0 KiB
SSH Connection Errors During Deployment
Date: 2024-12-19 Auteur: Équipe 4NK
Problem Description
During deployment, SSH connection errors occurred when verifying the Git repository on the server. The errors were:
mux_client_request_session: read from master failed: Connection reset by peer
Failed to connect to new control master
mm_send_fd: sendmsg(2): Broken pipe
mux_client_request_session: send fds failed
These errors appeared at step 5 of the deployment script when checking if Git is initialized on the server.
Additionally, the script could hang indefinitely at step 5 due to command substitution blocking when SSH connections failed or timed out.
Impact
- Severity: Medium
- Scope: Deployment script reliability
- User Impact: Deployment could fail or continue with errors, potentially leaving the server in an inconsistent state
- Frequency: Intermittent, occurring when SSH ControlMaster connection is interrupted
Root Cause
The SSH ControlMaster multiplexing connection was being closed prematurely or becoming stale, causing subsequent SSH commands to fail. The original ssh_exec function did not handle connection failures robustly:
- No connection validation: The function did not check if the ControlMaster socket was still valid before use
- No retry mechanism: Failed connections were not retried after cleanup
- No dead connection cleanup: Stale connections were not detected and cleaned up before reuse
- Silent failures: Connection errors in conditional checks could be misinterpreted as command failures
Root Cause Analysis
The SSH ControlMaster feature creates a persistent connection to avoid multiple SSH handshakes. However:
- Network interruptions can close the master connection
- The ControlMaster socket file can become stale if the connection dies
- The script did not detect or handle these cases, leading to cascading failures
- Command substitution
$()with SSH commands can block indefinitely if the connection hangs - The original logic used
grepon command output, which required capturing all output and could hang
Corrections Applied
1. Enhanced SSH Connection Management
File: deploy.sh
Changes:
- Added
cleanup_dead_ssh()function to properly clean up dead SSH connections - Added
check_ssh_connection()function to validate ControlMaster connection before use - Enhanced
ssh_exec()function with:- Connection validation before each command
- Automatic cleanup of dead connections
- Retry mechanism (up to 3 attempts)
- Additional SSH options for better connection stability:
ConnectTimeout=10: Fail fast if connection cannot be establishedServerAliveInterval=60: Keep connection aliveServerAliveCountMax=3: Detect dead connections quickly
2. Improved Error Handling at Step 5
File: deploy.sh
Changes:
- Enhanced Git repository verification to properly handle SSH connection errors
- Added explicit error detection and recovery mechanism
- Added automatic retry after connection cleanup
- Better error messages to distinguish between connection errors and Git initialization needs
- Fixed blocking issue: Removed command substitution
$()that could hang indefinitely - Simplified logic to use direct exit code checking instead of parsing output
- Improved timeout handling to prevent script from hanging
Modifications
Files Modified
deploy.sh:- Enhanced
ssh_exec()function with retry logic and connection validation - Added
cleanup_dead_ssh()andcheck_ssh_connection()helper functions - Improved error handling in Git repository verification step
- Enhanced
Code Changes
Before:
ssh_exec() {
ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
}
After:
ssh_exec() {
# Validates connection, cleans up dead connections, retries on failure
# Includes connection stability options
}
Deployment Procedures
Automatic Deployment
The fix is automatically applied when using the deployment script:
./deploy.sh "commit message"
No manual intervention required. The script now handles SSH connection errors automatically.
Verification
After deployment, verify that SSH connections are stable:
- Check that deployment completes without SSH errors
- Monitor for connection errors in subsequent deployments
- Verify that retry mechanism works correctly
Analysis Procedures
Monitoring SSH Connection Issues
-
Check deployment logs for SSH connection errors:
# Review recent deployment output -
Verify SSH ControlMaster socket:
# On the deployment machine ls -la /tmp/ssh_control_*/ -
Test SSH connection manually:
ssh -O check -o ControlPath="/tmp/ssh_control_*/debian_92.243.27.35_22" debian@92.243.27.35
Debugging Steps
If SSH connection errors persist:
- Check network connectivity to the server
- Verify SSH server configuration allows ControlMaster
- Check for firewall or network issues
- Review SSH server logs on the remote machine
- Verify SSH key authentication is working
Logs to Review
- Deployment script output (stdout/stderr)
- SSH client logs (if verbose mode enabled)
- Remote SSH server logs:
/var/log/auth.logor similar
Prevention
Best Practices
- Connection validation: Always validate SSH connections before use
- Retry logic: Implement retry mechanisms for network operations
- Cleanup: Properly clean up stale connections
- Error handling: Distinguish between different types of failures
- Monitoring: Monitor connection stability over time
Future Improvements
- Add connection health metrics
- Implement exponential backoff for retries
- Add connection pooling if needed
- Consider alternative connection methods if ControlMaster proves unreliable
Related Issues
None identified at this time.
References
- SSH ControlMaster documentation
- Deployment script:
deploy.sh - Related documentation:
docs/deployment.md