Update SSH connection errors documentation with root cause analysis
**Motivations:** - Document the deep root cause of ControlSocket already exists errors - Explain why ControlMaster=auto doesn't remove invalid sockets - Document the solution to aggressively clean up dead sockets **Root causes:** - Documentation needed to explain the root cause of socket invalidation - Need to document why directory removal is necessary for proper cleanup **Correctifs:** - Added deep root cause analysis section explaining ControlMaster=auto behavior - Documented why dead sockets accumulate and cause subsequent failures - Updated corrections section with latest improvements **Evolutions:** - More complete documentation of the problem and solution **Pages affectées:** - fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections
This commit is contained in:
parent
01110cd322
commit
f02b3938a1
@ -44,6 +44,26 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
|
||||
- Command substitution `$()` with SSH commands can block indefinitely if the connection hangs
|
||||
- The original logic used `grep` on command output, which required capturing all output and could hang
|
||||
|
||||
### Deep Root Cause: ControlSocket Already Exists
|
||||
|
||||
The fundamental issue is that:
|
||||
|
||||
1. **Socket can become invalid between check and execution**: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution.
|
||||
|
||||
2. **ControlMaster=auto behavior**: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it:
|
||||
- Disables multiplexing for that command
|
||||
- Creates a new connection without multiplexing
|
||||
- Leaves the dead socket file in place
|
||||
|
||||
3. **Dead socket accumulation**: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors.
|
||||
|
||||
4. **Insufficient cleanup**: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup.
|
||||
|
||||
The solution is to:
|
||||
- Test actual connection usability (not just socket existence)
|
||||
- Aggressively remove the entire socket directory when invalid
|
||||
- Recreate the directory for fresh connections
|
||||
|
||||
## Corrections Applied
|
||||
|
||||
### 1. Enhanced SSH Connection Management
|
||||
@ -56,12 +76,17 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
|
||||
- Enhanced `ssh_exec()` function with:
|
||||
- Connection validation before each command
|
||||
- Automatic cleanup of dead connections
|
||||
- Retry mechanism (up to 3 attempts)
|
||||
- Additional SSH options for better connection stability:
|
||||
- `ConnectTimeout=10`: Fail fast if connection cannot be established
|
||||
- `ServerAliveInterval=60`: Keep connection alive
|
||||
- `ServerAliveCountMax=3`: Detect dead connections quickly
|
||||
|
||||
**Latest improvements (root cause fix)**:
|
||||
- `check_ssh_connection()` now tests actual connection usability with `ssh ... true` instead of just `ssh -O check`
|
||||
- `cleanup_dead_ssh()` now removes entire socket directory instead of just socket file
|
||||
- This ensures socket is truly removed even if process is still holding it
|
||||
- Directory is recreated after cleanup for fresh connections
|
||||
|
||||
### 2. Improved Error Handling at Step 5
|
||||
|
||||
**File**: `deploy.sh`
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user