Fix SSH connection errors during deployment

**Motivations:** - SSH ControlMaster connection errors were causing deployment failures - Connection reset errors were not handled properly - No retry mechanism for failed SSH connections **Root causes:** - SSH ControlMaster socket could become stale or be closed prematurely - No validation of connection before use - No cleanup of dead connections - Silent failures in conditional checks **Correctifs:** - Added connection validation before each SSH command - Implemented automatic cleanup of dead SSH connections - Added retry mechanism (up to 3 attempts) with connection cleanup - Enhanced SSH options for better connection stability (ConnectTimeout, ServerAliveInterval, ServerAliveCountMax) - Improved error handling in Git repository verification step with explicit error detection and recovery **Evolutions:** - Enhanced SSH connection management with robust error handling - Better error messages to distinguish connection errors from other failures **Pages affectées:** - deploy.sh: Enhanced ssh_exec() function, added helper functions, improved error handling - fixKnowledge/ssh-connection-errors-deployment.md: Documentation of the problem, root cause, and solution
2026-01-06 14:22:24 +01:00 · 2026-01-06 14:22:24 +01:00 · 8813498de4
commit 8813498de4
parent 13e0e0d801
2 changed files with 243 additions and 5 deletions
--- a/deploy.sh
+++ b/deploy.sh
@ -14,14 +14,59 @@ SSH_CONTROL_DIR="/tmp/ssh_control_$$"
 mkdir -p "${SSH_CONTROL_DIR}"
 SSH_CONTROL_PATH="${SSH_CONTROL_DIR}/debian_92.243.27.35_22"

-# Fonction pour exécuter une commande SSH avec connexion persistante
+# Fonction pour nettoyer une connexion SSH morte
+cleanup_dead_ssh() {
+    ssh -O exit -o ControlPath="${SSH_CONTROL_PATH}" ${SERVER} 2>/dev/null || true
+    rm -f "${SSH_CONTROL_PATH}" 2>/dev/null || true
+}
+
+# Fonction pour vérifier si la connexion SSH maître est valide
+check_ssh_connection() {
+    ssh -O check -o ControlPath="${SSH_CONTROL_PATH}" ${SERVER} 2>/dev/null || return 1
+}
+
+# Fonction pour exécuter une commande SSH avec connexion persistante et gestion d'erreurs robuste
 ssh_exec() {
-    ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
+    local max_retries=3
+    local retry_count=0
+    
+    while [ $retry_count -lt $max_retries ]; do
+        # Vérifier si la connexion maître existe et est valide
+        if [ -S "${SSH_CONTROL_PATH}" ]; then
+            if ! check_ssh_connection; then
+                # Connexion morte, nettoyer avant de réessayer
+                cleanup_dead_ssh
+            fi
+        fi
+        
+        # Exécuter la commande SSH
+        if ssh -o ControlMaster=auto \
+              -o ControlPath="${SSH_CONTROL_PATH}" \
+              -o ControlPersist=300 \
+              -o ConnectTimeout=10 \
+              -o ServerAliveInterval=60 \
+              -o ServerAliveCountMax=3 \
+              ${SERVER} "$@" 2>&1; then
+            return 0
+        else
+            local exit_code=$?
+            retry_count=$((retry_count + 1))
+            
+            if [ $retry_count -lt $max_retries ]; then
+                # Nettoyer la connexion morte avant de réessayer
+                cleanup_dead_ssh
+                sleep 1
+            else
+                # Dernière tentative échouée, retourner le code d'erreur
+                return $exit_code
+            fi
+        fi
+    done
 }

 # Nettoyer les connexions SSH persistantes et le répertoire temporaire à la fin
 cleanup_ssh() {
-    ssh -O exit -o ControlPath="${SSH_CONTROL_PATH}" ${SERVER} 2>/dev/null || true
+    cleanup_dead_ssh
    rm -rf "${SSH_CONTROL_DIR}" 2>/dev/null || true
 }
 trap cleanup_ssh EXIT
@ -99,13 +144,27 @@ fi
 # Vérifier si Git est initialisé sur le serveur
 echo ""
 echo "5. Vérification du dépôt Git sur le serveur..."
-if ssh_exec "cd ${APP_DIR} && git status >/dev/null 2>&1"; then
+GIT_STATUS_OUTPUT=$(ssh_exec "cd ${APP_DIR} && git status >/dev/null 2>&1 && echo 'OK' || echo 'NOT_INIT'")
+if echo "$GIT_STATUS_OUTPUT" | grep -q "OK"; then
    echo "   ✓ Dépôt Git détecté"
+elif echo "$GIT_STATUS_OUTPUT" | grep -q "NOT_INIT"; then
+    echo "   ⚠ Dépôt Git non initialisé, initialisation..."
+    ssh_exec "cd ${APP_DIR} && git init && git remote add origin ${GIT_REPO} 2>/dev/null || git remote set-url origin ${GIT_REPO}"
+    ssh_exec "cd ${APP_DIR} && git checkout -b ${BRANCH} 2>/dev/null || true"
+else
+    echo "   ✗ Erreur de connexion SSH lors de la vérification du dépôt Git"
+    echo "   Tentative de nettoyage et nouvelle connexion..."
+    cleanup_dead_ssh
+    sleep 2
+    # Réessayer une fois après nettoyage
+    if ssh_exec "cd ${APP_DIR} && git status >/dev/null 2>&1"; then
+        echo "   ✓ Dépôt Git détecté après réessai"
    else
        echo "   ⚠ Dépôt Git non initialisé, initialisation..."
        ssh_exec "cd ${APP_DIR} && git init && git remote add origin ${GIT_REPO} 2>/dev/null || git remote set-url origin ${GIT_REPO}"
        ssh_exec "cd ${APP_DIR} && git checkout -b ${BRANCH} 2>/dev/null || true"
    fi
+fi

 # Récupérer les dernières modifications
 echo ""
--- a/fixKnowledge/ssh-connection-errors-deployment.md
+++ b/fixKnowledge/ssh-connection-errors-deployment.md
@ -0,0 +1,179 @@
+# SSH Connection Errors During Deployment
+
+**Date**: 2024-12-19  
+**Auteur**: Équipe 4NK
+
+## Problem Description
+
+During deployment, SSH connection errors occurred when verifying the Git repository on the server. The errors were:
+
+```
+mux_client_request_session: read from master failed: Connection reset by peer
+Failed to connect to new control master
+mm_send_fd: sendmsg(2): Broken pipe
+mux_client_request_session: send fds failed
+```
+
+These errors appeared at step 5 of the deployment script when checking if Git is initialized on the server.
+
+## Impact
+
+- **Severity**: Medium
+- **Scope**: Deployment script reliability
+- **User Impact**: Deployment could fail or continue with errors, potentially leaving the server in an inconsistent state
+- **Frequency**: Intermittent, occurring when SSH ControlMaster connection is interrupted
+
+## Root Cause
+
+The SSH ControlMaster multiplexing connection was being closed prematurely or becoming stale, causing subsequent SSH commands to fail. The original `ssh_exec` function did not handle connection failures robustly:
+
+1. **No connection validation**: The function did not check if the ControlMaster socket was still valid before use
+2. **No retry mechanism**: Failed connections were not retried after cleanup
+3. **No dead connection cleanup**: Stale connections were not detected and cleaned up before reuse
+4. **Silent failures**: Connection errors in conditional checks could be misinterpreted as command failures
+
+## Root Cause Analysis
+
+The SSH ControlMaster feature creates a persistent connection to avoid multiple SSH handshakes. However:
+
+- Network interruptions can close the master connection
+- The ControlMaster socket file can become stale if the connection dies
+- The script did not detect or handle these cases, leading to cascading failures
+
+## Corrections Applied
+
+### 1. Enhanced SSH Connection Management
+
+**File**: `deploy.sh`
+
+**Changes**:
+- Added `cleanup_dead_ssh()` function to properly clean up dead SSH connections
+- Added `check_ssh_connection()` function to validate ControlMaster connection before use
+- Enhanced `ssh_exec()` function with:
+  - Connection validation before each command
+  - Automatic cleanup of dead connections
+  - Retry mechanism (up to 3 attempts)
+  - Additional SSH options for better connection stability:
+    - `ConnectTimeout=10`: Fail fast if connection cannot be established
+    - `ServerAliveInterval=60`: Keep connection alive
+    - `ServerAliveCountMax=3`: Detect dead connections quickly
+
+### 2. Improved Error Handling at Step 5
+
+**File**: `deploy.sh`
+
+**Changes**:
+- Enhanced Git repository verification to properly handle SSH connection errors
+- Added explicit error detection and recovery mechanism
+- Added automatic retry after connection cleanup
+- Better error messages to distinguish between connection errors and Git initialization needs
+
+## Modifications
+
+### Files Modified
+
+- `deploy.sh`:
+  - Enhanced `ssh_exec()` function with retry logic and connection validation
+  - Added `cleanup_dead_ssh()` and `check_ssh_connection()` helper functions
+  - Improved error handling in Git repository verification step
+
+### Code Changes
+
+**Before**:
+```bash
+ssh_exec() {
+    ssh -o ControlMaster=auto -o ControlPath="${SSH_CONTROL_PATH}" -o ControlPersist=300 ${SERVER} "$@"
+}
+```
+
+**After**:
+```bash
+ssh_exec() {
+    # Validates connection, cleans up dead connections, retries on failure
+    # Includes connection stability options
+}
+```
+
+## Deployment Procedures
+
+### Automatic Deployment
+
+The fix is automatically applied when using the deployment script:
+
+```bash
+./deploy.sh "commit message"
+```
+
+No manual intervention required. The script now handles SSH connection errors automatically.
+
+### Verification
+
+After deployment, verify that SSH connections are stable:
+
+1. Check that deployment completes without SSH errors
+2. Monitor for connection errors in subsequent deployments
+3. Verify that retry mechanism works correctly
+
+## Analysis Procedures
+
+### Monitoring SSH Connection Issues
+
+1. **Check deployment logs** for SSH connection errors:
+   ```bash
+   # Review recent deployment output
+   ```
+
+2. **Verify SSH ControlMaster socket**:
+   ```bash
+   # On the deployment machine
+   ls -la /tmp/ssh_control_*/
+   ```
+
+3. **Test SSH connection manually**:
+   ```bash
+   ssh -O check -o ControlPath="/tmp/ssh_control_*/debian_92.243.27.35_22" debian@92.243.27.35
+   ```
+
+### Debugging Steps
+
+If SSH connection errors persist:
+
+1. Check network connectivity to the server
+2. Verify SSH server configuration allows ControlMaster
+3. Check for firewall or network issues
+4. Review SSH server logs on the remote machine
+5. Verify SSH key authentication is working
+
+### Logs to Review
+
+- Deployment script output (stdout/stderr)
+- SSH client logs (if verbose mode enabled)
+- Remote SSH server logs: `/var/log/auth.log` or similar
+
+## Prevention
+
+### Best Practices
+
+1. **Connection validation**: Always validate SSH connections before use
+2. **Retry logic**: Implement retry mechanisms for network operations
+3. **Cleanup**: Properly clean up stale connections
+4. **Error handling**: Distinguish between different types of failures
+5. **Monitoring**: Monitor connection stability over time
+
+### Future Improvements
+
+- Add connection health metrics
+- Implement exponential backoff for retries
+- Add connection pooling if needed
+- Consider alternative connection methods if ControlMaster proves unreliable
+
+## Related Issues
+
+None identified at this time.
+
+## References
+
+- SSH ControlMaster documentation
+- Deployment script: `deploy.sh`
+- Related documentation: `docs/deployment.md`
+