# Diagnostic : Loki Unhealthy - Causes et Solutions

## 🔍 Analyse du Problème

### **Symptômes Observés**
- Loki démarre et fonctionne (logs normaux)
- Endpoint `/ready` retourne "ready" depuis l'intérieur du conteneur
- Healthcheck externe retourne HTTP 503 "Service Unavailable"
- Message d'erreur : "Ingester not ready: waiting for 15s after being ready"
- Healthcheck Docker marque le service comme "unhealthy"

### **Cause Racine Identifiée**
Loki a un **délai d'attente de 15 secondes** après être "prêt" avant que l'endpoint `/ready` retourne un code HTTP 200. Pendant cette période, il retourne HTTP 503.

## 🚨 Raisons Possibles pour Loki Unhealthy

### **1. Délai de Démarrage Insuffisant (PRINCIPAL)**
```bash
# Configuration actuelle
start_period: 60s
interval: 30s
timeout: 10s
retries: 3
```

**Problème** : Loki prend plus de 60 secondes pour être complètement prêt
**Solution** : Augmenter le `start_period`

### **2. Configuration Loki Incomplète**
- Fichier de configuration manquant ou incorrect
- Variables d'environnement non définies
- Permissions sur les volumes incorrectes

### **3. Ressources Système Insuffisantes**
- Mémoire insuffisante pour Loki
- CPU surchargé
- Espace disque insuffisant

### **4. Problème de Réseau**
- Port 3100 bloqué ou en conflit
- Configuration réseau Docker incorrecte
- Firewall bloquant les connexions

### **5. Configuration Healthcheck Incorrecte**
- Timeout trop court (10s)
- Intervalle trop fréquent (30s)
- Retries insuffisantes (3)

### **6. Problème de Configuration Loki**
- Configuration par défaut inadaptée
- Paramètres de stockage incorrects
- Configuration des composants (ingester, distributor, etc.)

## 🔧 Solutions Proposées

### **Solution 1: Augmenter le Start Period (RECOMMANDÉE)**
```yaml
loki:
  healthcheck:
    test: ["CMD", "sh", "-c", "if curl -f http://localhost:3100/ready >/dev/null 2>&1; then echo 'Loki ready: Log aggregation service responding'; exit 0; else echo 'Loki starting: Log aggregation service not yet ready'; exit 1; fi"]
    interval: 30s
    timeout: 15s          # Augmenté de 10s à 15s
    retries: 5            # Augmenté de 3 à 5
    start_period: 120s    # Augmenté de 60s à 120s
```

### **Solution 2: Healthcheck Alternatif**
```yaml
loki:
  healthcheck:
    test: ["CMD", "sh", "-c", "if wget -q --spider http://localhost:3100/ready; then echo 'Loki ready: Log aggregation service responding'; exit 0; else echo 'Loki starting: Log aggregation service not yet ready'; exit 1; fi"]
    interval: 30s
    timeout: 15s
    retries: 5
    start_period: 120s
```

### **Solution 3: Healthcheck Simplifié**
```yaml
loki:
  healthcheck:
    test: ["CMD", "sh", "-c", "wget -q --spider http://localhost:3100/ready"]
    interval: 30s
    timeout: 15s
    retries: 5
    start_period: 120s
```

### **Solution 4: Configuration Loki Optimisée**
```yaml
loki:
  command: -config.file=/etc/loki/local-config.yaml -server.http-listen-port=3100 -server.grpc-listen-port=9096
  environment:
    - LOKI_READY_DELAY=5s
```

## 🧪 Tests de Diagnostic

### **Test 1: Vérifier la Configuration**
```bash
# Vérifier la configuration Loki
docker exec loki cat /etc/loki/local-config.yaml
```

### **Test 2: Vérifier les Ressources**
```bash
# Vérifier l'utilisation des ressources
docker stats loki
```

### **Test 3: Vérifier les Logs Détaillés**
```bash
# Logs avec plus de détails
docker logs loki --tail 100
```

### **Test 4: Test de Connectivité**
```bash
# Test depuis l'extérieur
curl -v http://localhost:3100/ready

# Test depuis l'intérieur
docker exec loki wget -q -O- http://localhost:3100/ready
```

### **Test 5: Vérifier les Volumes**
```bash
# Vérifier les permissions des volumes
docker exec loki ls -la /loki
```

## 📊 Configuration Recommandée

### **Healthcheck Optimisé**
```yaml
loki:
  image: grafana/loki:latest
  container_name: loki
  ports:
    - "0.0.0.0:3100:3100"
  volumes:
    - loki_data:/loki
  command: -config.file=/etc/loki/local-config.yaml
  networks:
    btcnet:
      aliases:
        - loki
  healthcheck:
    test: ["CMD", "sh", "-c", "if wget -q --spider http://localhost:3100/ready; then echo 'Loki ready: Log aggregation service responding'; exit 0; else echo 'Loki starting: Log aggregation service not yet ready'; exit 1; fi"]
    interval: 30s
    timeout: 15s
    retries: 5
    start_period: 120s
  restart: unless-stopped
```

### **Variables d'Environnement**
```yaml
loki:
  environment:
    - LOKI_READY_DELAY=5s
    - LOKI_LOG_LEVEL=info
```

## 🎯 Plan d'Action

### **Étape 1: Diagnostic Immédiat**
1. Vérifier la configuration actuelle
2. Analyser les logs détaillés
3. Tester la connectivité

### **Étape 2: Application des Corrections**
1. Augmenter le `start_period` à 120s
2. Augmenter le `timeout` à 15s
3. Augmenter les `retries` à 5

### **Étape 3: Test et Validation**
1. Redémarrer Loki
2. Surveiller le healthcheck
3. Vérifier le statut final

### **Étape 4: Optimisation Continue**
1. Ajuster les paramètres si nécessaire
2. Documenter les améliorations
3. Mettre à jour la configuration

## 🔍 Points d'Attention

### **Signaux d'Alerte**
- Healthcheck qui échoue constamment
- Logs d'erreur dans Loki
- Ressources système élevées
- Timeouts fréquents

### **Indicateurs de Succès**
- Healthcheck "healthy" stable
- Endpoint `/ready` retourne HTTP 200
- Logs Loki normaux
- Performance acceptable

---

**Document créé le 2025-09-21**
**Version** : 1.0
**Diagnostic** : Loki Unhealthy Analysis