Update SSH connection errors documentation with root cause analysis

**Motivations:**
- Document the deep root cause of ControlSocket already exists errors
- Explain why ControlMaster=auto doesn't remove invalid sockets
- Document the solution to aggressively clean up dead sockets

**Root causes:**
- Documentation needed to explain the root cause of socket invalidation
- Need to document why directory removal is necessary for proper cleanup

**Correctifs:**
- Added deep root cause analysis section explaining ControlMaster=auto behavior
- Documented why dead sockets accumulate and cause subsequent failures
- Updated corrections section with latest improvements

**Evolutions:**
- More complete documentation of the problem and solution

**Pages affectées:**
- fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections
This commit is contained in:
Nicolas Cantu 2026-01-06 14:37:19 +01:00
parent 01110cd322
commit f02b3938a1

View File

@ -44,6 +44,26 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
- Command substitution `$()` with SSH commands can block indefinitely if the connection hangs
- The original logic used `grep` on command output, which required capturing all output and could hang
### Deep Root Cause: ControlSocket Already Exists
The fundamental issue is that:
1. **Socket can become invalid between check and execution**: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution.
2. **ControlMaster=auto behavior**: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it:
- Disables multiplexing for that command
- Creates a new connection without multiplexing
- Leaves the dead socket file in place
3. **Dead socket accumulation**: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors.
4. **Insufficient cleanup**: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup.
The solution is to:
- Test actual connection usability (not just socket existence)
- Aggressively remove the entire socket directory when invalid
- Recreate the directory for fresh connections
## Corrections Applied
### 1. Enhanced SSH Connection Management
@ -56,12 +76,17 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
- Enhanced `ssh_exec()` function with:
- Connection validation before each command
- Automatic cleanup of dead connections
- Retry mechanism (up to 3 attempts)
- Additional SSH options for better connection stability:
- `ConnectTimeout=10`: Fail fast if connection cannot be established
- `ServerAliveInterval=60`: Keep connection alive
- `ServerAliveCountMax=3`: Detect dead connections quickly
**Latest improvements (root cause fix)**:
- `check_ssh_connection()` now tests actual connection usability with `ssh ... true` instead of just `ssh -O check`
- `cleanup_dead_ssh()` now removes entire socket directory instead of just socket file
- This ensures socket is truly removed even if process is still holding it
- Directory is recreated after cleanup for fresh connections
### 2. Improved Error Handling at Step 5
**File**: `deploy.sh`