From f02b3938a124dd9b49c3aba5133c2da51e650bcd Mon Sep 17 00:00:00 2001 From: Nicolas Cantu Date: Tue, 6 Jan 2026 14:37:19 +0100 Subject: [PATCH] Update SSH connection errors documentation with root cause analysis MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit **Motivations:** - Document the deep root cause of ControlSocket already exists errors - Explain why ControlMaster=auto doesn't remove invalid sockets - Document the solution to aggressively clean up dead sockets **Root causes:** - Documentation needed to explain the root cause of socket invalidation - Need to document why directory removal is necessary for proper cleanup **Correctifs:** - Added deep root cause analysis section explaining ControlMaster=auto behavior - Documented why dead sockets accumulate and cause subsequent failures - Updated corrections section with latest improvements **Evolutions:** - More complete documentation of the problem and solution **Pages affectées:** - fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections --- .../ssh-connection-errors-deployment.md | 27 ++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/fixKnowledge/ssh-connection-errors-deployment.md b/fixKnowledge/ssh-connection-errors-deployment.md index 7067924..ef53fd9 100644 --- a/fixKnowledge/ssh-connection-errors-deployment.md +++ b/fixKnowledge/ssh-connection-errors-deployment.md @@ -44,6 +44,26 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple - Command substitution `$()` with SSH commands can block indefinitely if the connection hangs - The original logic used `grep` on command output, which required capturing all output and could hang +### Deep Root Cause: ControlSocket Already Exists + +The fundamental issue is that: + +1. **Socket can become invalid between check and execution**: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution. + +2. **ControlMaster=auto behavior**: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it: + - Disables multiplexing for that command + - Creates a new connection without multiplexing + - Leaves the dead socket file in place + +3. **Dead socket accumulation**: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors. + +4. **Insufficient cleanup**: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup. + +The solution is to: +- Test actual connection usability (not just socket existence) +- Aggressively remove the entire socket directory when invalid +- Recreate the directory for fresh connections + ## Corrections Applied ### 1. Enhanced SSH Connection Management @@ -56,12 +76,17 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple - Enhanced `ssh_exec()` function with: - Connection validation before each command - Automatic cleanup of dead connections - - Retry mechanism (up to 3 attempts) - Additional SSH options for better connection stability: - `ConnectTimeout=10`: Fail fast if connection cannot be established - `ServerAliveInterval=60`: Keep connection alive - `ServerAliveCountMax=3`: Detect dead connections quickly +**Latest improvements (root cause fix)**: +- `check_ssh_connection()` now tests actual connection usability with `ssh ... true` instead of just `ssh -O check` +- `cleanup_dead_ssh()` now removes entire socket directory instead of just socket file +- This ensures socket is truly removed even if process is still holding it +- Directory is recreated after cleanup for fresh connections + ### 2. Improved Error Handling at Step 5 **File**: `deploy.sh`