Update SSH connection errors documentation with root cause analysis
**Motivations:** - Document the deep root cause of ControlSocket already exists errors - Explain why ControlMaster=auto doesn't remove invalid sockets - Document the solution to aggressively clean up dead sockets **Root causes:** - Documentation needed to explain the root cause of socket invalidation - Need to document why directory removal is necessary for proper cleanup **Correctifs:** - Added deep root cause analysis section explaining ControlMaster=auto behavior - Documented why dead sockets accumulate and cause subsequent failures - Updated corrections section with latest improvements **Evolutions:** - More complete documentation of the problem and solution **Pages affectées:** - fixKnowledge/ssh-connection-errors-deployment.md: Added root cause analysis and updated corrections
This commit is contained in:
parent
01110cd322
commit
f02b3938a1
@ -44,6 +44,26 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
|
|||||||
- Command substitution `$()` with SSH commands can block indefinitely if the connection hangs
|
- Command substitution `$()` with SSH commands can block indefinitely if the connection hangs
|
||||||
- The original logic used `grep` on command output, which required capturing all output and could hang
|
- The original logic used `grep` on command output, which required capturing all output and could hang
|
||||||
|
|
||||||
|
### Deep Root Cause: ControlSocket Already Exists
|
||||||
|
|
||||||
|
The fundamental issue is that:
|
||||||
|
|
||||||
|
1. **Socket can become invalid between check and execution**: The socket file can exist on the filesystem, but the underlying connection can die between the validation check and the actual command execution.
|
||||||
|
|
||||||
|
2. **ControlMaster=auto behavior**: When SSH detects that a ControlMaster socket exists but is invalid, it does NOT remove the socket. Instead, it:
|
||||||
|
- Disables multiplexing for that command
|
||||||
|
- Creates a new connection without multiplexing
|
||||||
|
- Leaves the dead socket file in place
|
||||||
|
|
||||||
|
3. **Dead socket accumulation**: The dead socket remains and causes subsequent commands to fail with "ControlSocket already exists, disabling multiplexing" errors.
|
||||||
|
|
||||||
|
4. **Insufficient cleanup**: Previous cleanup attempts only removed the socket file, but if a process was still holding it, the socket could remain. Removing the entire directory ensures complete cleanup.
|
||||||
|
|
||||||
|
The solution is to:
|
||||||
|
- Test actual connection usability (not just socket existence)
|
||||||
|
- Aggressively remove the entire socket directory when invalid
|
||||||
|
- Recreate the directory for fresh connections
|
||||||
|
|
||||||
## Corrections Applied
|
## Corrections Applied
|
||||||
|
|
||||||
### 1. Enhanced SSH Connection Management
|
### 1. Enhanced SSH Connection Management
|
||||||
@ -56,12 +76,17 @@ The SSH ControlMaster feature creates a persistent connection to avoid multiple
|
|||||||
- Enhanced `ssh_exec()` function with:
|
- Enhanced `ssh_exec()` function with:
|
||||||
- Connection validation before each command
|
- Connection validation before each command
|
||||||
- Automatic cleanup of dead connections
|
- Automatic cleanup of dead connections
|
||||||
- Retry mechanism (up to 3 attempts)
|
|
||||||
- Additional SSH options for better connection stability:
|
- Additional SSH options for better connection stability:
|
||||||
- `ConnectTimeout=10`: Fail fast if connection cannot be established
|
- `ConnectTimeout=10`: Fail fast if connection cannot be established
|
||||||
- `ServerAliveInterval=60`: Keep connection alive
|
- `ServerAliveInterval=60`: Keep connection alive
|
||||||
- `ServerAliveCountMax=3`: Detect dead connections quickly
|
- `ServerAliveCountMax=3`: Detect dead connections quickly
|
||||||
|
|
||||||
|
**Latest improvements (root cause fix)**:
|
||||||
|
- `check_ssh_connection()` now tests actual connection usability with `ssh ... true` instead of just `ssh -O check`
|
||||||
|
- `cleanup_dead_ssh()` now removes entire socket directory instead of just socket file
|
||||||
|
- This ensures socket is truly removed even if process is still holding it
|
||||||
|
- Directory is recreated after cleanup for fresh connections
|
||||||
|
|
||||||
### 2. Improved Error Handling at Step 5
|
### 2. Improved Error Handling at Step 5
|
||||||
|
|
||||||
**File**: `deploy.sh`
|
**File**: `deploy.sh`
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user