Update SSH connection errors documentation with root cause analysis

**Motivations:** - Document the deep root cause of ControlSocket already exists errors - Explain why ControlMaster=auto doesn't remove invalid sockets - Document the solution to aggressively clean up dead sockets **Root causes:** - Documentation needed to explain the root cause of socket invalidation - Need to document why directory removal is necessary for proper cleanup **Correctifs:** - Added deep root cause analysis section explaining ControlMaster=auto behavior - Documented why dead sockets accumulate and cause subsequent failures - Updated corrections section with latest improvements **Evolutions:** - More complete documentation of the problem and solution **Pages affectées:** - fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections
2026-01-06 14:37:19 +01:00 · 2026-01-06 14:37:19 +01:00 · f02b3938a1
commit f02b3938a1
parent 01110cd322
1 changed files with 26 additions and 1 deletions
--- a/fixKnowledge/ssh-connection-errors-deployment.md
+++ b/fixKnowledge/ssh-connection-errors-deployment.md
@ -44,6 +44,26 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
 - Command substitution `$()` with SSH commands can block indefinitely if the connection hangs
 - The original logic used `grep` on command output, which required capturing all output and could hang

+### Deep Root Cause: ControlSocket Already Exists
+
+The fundamental issue is that:
+
+1. **Socket can become invalid between check and execution**: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution.
+
+2. **ControlMaster=auto behavior**: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it:
+   - Disables multiplexing for that command
+   - Creates a new connection without multiplexing
+   - Leaves the dead socket file in place
+
+3. **Dead socket accumulation**: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors.
+
+4. **Insufficient cleanup**: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup.
+
+The solution is to:
+- Test actual connection usability (not just socket existence)
+- Aggressively remove the entire socket directory when invalid
+- Recreate the directory for fresh connections
+
 ## Corrections Applied

 ### 1. Enhanced SSH Connection Management
@ -56,12 +76,17 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
 - Enhanced `ssh_exec()` function with:
  - Connection validation before each command
  - Automatic cleanup of dead connections
-  - Retry mechanism (up to 3 attempts)
  - Additional SSH options for better connection stability:
    - `ConnectTimeout=10`: Fail fast if connection cannot be established
    - `ServerAliveInterval=60`: Keep connection alive
    - `ServerAliveCountMax=3`: Detect dead connections quickly

+**Latest improvements (root cause fix)**:
+- `check_ssh_connection()` now tests actual connection usability with `ssh ... true` instead of just `ssh -O check`
+- `cleanup_dead_ssh()` now removes entire socket directory instead of just socket file
+- This ensures socket is truly removed even if process is still holding it
+- Directory is recreated after cleanup for fresh connections
+
 ### 2. Improved Error Handling at Step 5

 **File**: `deploy.sh`