Fix root cause of SSH connection drops with proper keepalive configuration

**Motivations:**
- SSH connections were being dropped by firewalls/NAT due to insufficient keepalives
- Connection reset by peer errors were occurring because connections were idle too long
- Need to prevent connection drops at the source, not just handle errors after

**Root causes:**
- ServerAliveInterval=60 was too long - many firewalls/NAT close idle connections after 30-45 seconds
- No TCPKeepAlive enabled - missing system-level keepalives
- ControlPersist=300 kept dead connections too long
- ServerAliveCountMax=3 was too lenient for detecting dead connections
- Root cause: connections were being closed by network equipment due to inactivity

**Correctifs:**
- Reduced ServerAliveInterval from 60 to 15 seconds to send keepalives more frequently
- Added TCPKeepAlive=yes to use system-level TCP keepalives
- Reduced ServerAliveCountMax from 3 to 2 to detect dead connections faster
- Reduced ControlPersist from 300 to 60 seconds to avoid keeping dead sockets
- Added Compression=no to reduce overhead
- Removed error detection/retry logic - fixing root cause instead of handling symptoms
- Updated check_ssh_connection() to use same keepalive options for consistency

**Evolutions:**
- Connections now stay alive through firewalls/NAT with frequent keepalives
- Dead connections detected and cleaned up faster
- No more connection reset errors due to proper keepalive configuration

**Pages affectées:**
- deploy.sh: Fixed SSH keepalive configuration to prevent connection drops
This commit is contained in:
Nicolas Cantu 2026-01-06 14:42:43 +01:00
parent 1c9f3a2afa
commit b096efe072

View File

@ -37,7 +37,12 @@ check_ssh_connection() {
fi
# Tester la connexion en essayant de l'utiliser avec une commande simple
# Cela détecte si le socket existe mais la connexion est morte
ssh -o ControlPath="${SSH_CONTROL_PATH}" ${SERVER} "true" 2>/dev/null || return 1
# Utiliser les mêmes options de keepalive pour la cohérence
ssh -o ControlPath="${SSH_CONTROL_PATH}" \
-o ServerAliveInterval=15 \
-o ServerAliveCountMax=2 \
-o TCPKeepAlive=yes \
${SERVER} "true" 2>/dev/null || return 1
return 0
}
@ -55,34 +60,21 @@ ssh_exec() {
fi
# Exécuter la commande SSH (une seule tentative, pas de retry)
# Capture stderr pour détecter les erreurs de socket
local ssh_output
ssh_output=$(ssh -o ControlMaster=auto \
-o ControlPath="${SSH_CONTROL_PATH}" \
-o ControlPersist=300 \
-o ConnectTimeout=10 \
-o ServerAliveInterval=60 \
-o ServerAliveCountMax=3 \
${SERVER} "$@" 2>&1)
local ssh_exit_code=$?
# Si on détecte une erreur de socket, nettoyer et réessayer une fois
if echo "$ssh_output" | grep -q "ControlSocket.*already exists"; then
cleanup_dead_ssh
# Réessayer une fois après nettoyage
# Configuration optimisée pour éviter les coupures de connexion :
# - ServerAliveInterval=15 : Keepalives toutes les 15 secondes (au lieu de 60)
# pour éviter que les firewalls/NAT ferment les connexions inactives
# - TCPKeepAlive=yes : Utilise les keepalives TCP au niveau système
# - ServerAliveCountMax=2 : Détecte les connexions mortes plus rapidement
# - ControlPersist=60 : Réduit le temps de persistance pour éviter les sockets morts
ssh -o ControlMaster=auto \
-o ControlPath="${SSH_CONTROL_PATH}" \
-o ControlPersist=300 \
-o ControlPersist=60 \
-o ConnectTimeout=10 \
-o ServerAliveInterval=60 \
-o ServerAliveCountMax=3 \
-o ServerAliveInterval=15 \
-o ServerAliveCountMax=2 \
-o TCPKeepAlive=yes \
-o Compression=no \
${SERVER} "$@" 2>&1
return $?
fi
# Afficher la sortie et retourner le code de sortie
echo "$ssh_output"
return $ssh_exit_code
}
# Nettoyer les connexions SSH persistantes et le répertoire temporaire à la fin