algo/docs/fixKnowledge/crash_paliers_finale_f16_oom.md
ncantu f05f2380ff Collatz: pipelines, scripts paliers, docs et fixKnowledge
**Motivations:**
- Conserver l'état des scripts Collatz k, pipelines et démonstration
- Documenter diagnostic D18/D21, errata, plan de preuve et correctif OOM paliers

**Root causes:**
- Consommation mémoire excessive (OOM) sur script paliers finale f16

**Correctifs:**
- Documentation du crash OOM paliers finale f16 et pistes de correction

**Evolutions:**
- Évolutions des pipelines fusion/k, recover/update noyau, script 08-paliers-finale
- Ajout de docs (diagnostic, errata, plan lemmes, fixKnowledge OOM)

**Pages affectées:**
- applications/collatz/collatz_k_scripts/*.py, note.md, requirements.txt
- applications/collatz/collatz_k_scripts/*.md (diagnostic, errata, plan)
- applications/collatz/scripts/08-paliers-finale.sh, README.md
- docs/fixKnowledge/crash_paliers_finale_f16_oom.md
2026-03-04 17:19:50 +01:00

5.2 KiB
Raw Blame History

Crash 08-paliers-finale / Cursor during F16

Problem

The script 08-paliers-finale.sh (extended pipeline D18→D21, F15, F16) crashes, and Cursor (which launched it) also crashes. No Python exception is logged; the last line in out/pipeline_extend.log is:

[2026-03-04 09:26:35] STEP start F16 fusion palier=2^35 rss_max_mb=11789

Root cause

  1. Where it stops: The process is killed during F16 (fusion pipeline, palier 2^35), right after D20 completed successfully.

  2. Why there is no [CRASH] line: The Python excepthook only runs on uncaught exceptions. The process was almost certainly killed by the Linux OOM killer (SIGKILL) when the system ran out of RAM. SIGKILL cannot be caught; the process disappears without running exception handlers.

  3. Memory sequence:

    • After D20: rss_max_mb=11789 (~11.8 GB) with noyau_post_D20.json written (156 M residues, 1.77 GB on disk).
    • F16 starts and loads noyau_post_D20.json. An initial fix used stream load (ijson) with --modulo 9 so only residues with r % 9 == 0 are kept (~17 M residues). That still allocates a single list of ~17 M Python integers (on the order of several GB), so OOM can still occur on a 16 GB machine when combined with the rest of the process and Cursor.
    • A second fix uses chunked stream load: the noyau is streamed in chunks (e.g. 1.5 M residues per chunk); each chunk is passed to build_fusion_clauses() and only the output rows are accumulated. No single list of all filtered residues is ever built, so peak RSS stays bounded.
  4. Why Cursor crashes: Cursor and the pipeline share the same machine RAM. When the pipelines memory spikes during F16 load, either the Python process is killed (and Cursor stays up but the run “crashes”) or the system is so starved that the OOM killer also kills Cursor, or the machine becomes unresponsive and Cursor appears to crash.

Corrective actions

  • Run the extended pipeline outside Cursor: Use a standalone terminal (or SSH session, or nohup in a separate terminal) so Cursor is not in the same memory space. Example:
    • From a separate terminal: cd /home/ncantu/code/algo/applications/collatz && ./scripts/08-paliers-finale.sh
    • Or: nohup ./scripts/08-paliers-finale.sh > out/run.log 2>&1 &
  • Ensure enough free RAM before F16 (e.g. 20+ GB free, or close other heavy apps) if running on the same machine as Cursor.
  • Resume from D20 if D18D20 are already done: RESUME_FROM=D20 ./scripts/08-paliers-finale.sh still loads noyau_post_F15 then runs D20, then F16. To skip straight to F16 you would need a new option (e.g. RESUME_FROM=F16) and noyau_post_D20 already present; currently not implemented.

Impact

  • D18, D19, F15, D20 complete successfully; artefacts are in out/noyaux/ and out/candidats/.
  • F16 and D21 never run; Cursor can crash when the pipeline is started from inside Cursor on a RAM-limited machine.

Analysis modalities

  • Inspect last lines: tail -30 out/pipeline_extend.log.
  • Check for OOM in kernel logs: dmesg | grep -i out.of.memory or journalctl -k -b | grep -i oom (if available).
  • Monitor RSS during run: watch -n 5 'ps -o rss= -p $(pgrep -f "collatz_k_pipeline")' (RSS in KB).

Deployment

Run the script outside the Cursor process so that memory pressure does not kill Cursor. Code fix (two steps):

  1. Stream load (already in place)
    When the noyau file is >500 MB and --modulo is set, the fusion pipeline uses ijson to stream-parse the JSON and keep only residues with r % modulo == 0, instead of loading the full file with json.loads(). Install: pip3 install -r collatz_k_scripts/requirements.txt.

  2. Chunked processing (added after OOM persisted)
    For noyau files >500 MB with modulo set, the pipeline no longer builds a single list of all filtered residues. It uses _stream_load_noyau_modulo_chunked() to yield chunks (default 800k residues). For each chunk it runs build_fusion_clauses(), then appends the rows to the output CSV. Peak memory stays bounded by one chunk plus the audit state maps and the merged rows. F16 with noyau_post_D20.json (~1.7 GB, modulo 9) now completes and writes the fusion CSV.

  3. run_update_noyau stream path (post-F16 OOM)
    After F16, the pipeline calls run_update_noyau(cert_f16, noyau_post_D20, noyau_post_F16). That step was loading the full noyau_post_D20.json (1.7 GB, 156 M residues) with read_text() + json.loads(), causing OOM. For previous-noyau files >500 MB, run_update_noyau now uses _get_palier_from_tail() (read last 128 bytes to get palier) and _stream_update_noyau(): stream-parse the noyau with ijson, keep only residues not in the covered set (from the cert), and stream-write the new noyau JSON. No full noyau list is ever materialized.

  4. run_single_palier stream path (D21 OOM)
    D21 loads noyau_post_F16.json (~1.7 GB, ~156 M residues). Loading it fully in run_single_palier caused OOM. For input noyau files >500 MB, run_single_palier now uses _run_single_palier_stream: (1) stream pass to compute max_r and count; (2) stream pass to build cand and cover sets; (3) write CSV from cand; (4) stream pass to write residual noyau (only cover set in memory, residual written incrementally). No full residue list or full residual list is materialized.