**Motivations:** - Conserver l'état des scripts Collatz k, pipelines et démonstration - Documenter diagnostic D18/D21, errata, plan de preuve et correctif OOM paliers **Root causes:** - Consommation mémoire excessive (OOM) sur script paliers finale f16 **Correctifs:** - Documentation du crash OOM paliers finale f16 et pistes de correction **Evolutions:** - Évolutions des pipelines fusion/k, recover/update noyau, script 08-paliers-finale - Ajout de docs (diagnostic, errata, plan lemmes, fixKnowledge OOM) **Pages affectées:** - applications/collatz/collatz_k_scripts/*.py, note.md, requirements.txt - applications/collatz/collatz_k_scripts/*.md (diagnostic, errata, plan) - applications/collatz/scripts/08-paliers-finale.sh, README.md - docs/fixKnowledge/crash_paliers_finale_f16_oom.md
5.2 KiB
Crash 08-paliers-finale / Cursor during F16
Problem
The script 08-paliers-finale.sh (extended pipeline D18→D21, F15, F16) crashes, and Cursor (which launched it) also crashes. No Python exception is logged; the last line in out/pipeline_extend.log is:
[2026-03-04 09:26:35] STEP start F16 fusion palier=2^35 rss_max_mb=11789
Root cause
-
Where it stops: The process is killed during F16 (fusion pipeline, palier 2^35), right after D20 completed successfully.
-
Why there is no
[CRASH]line: The Python excepthook only runs on uncaught exceptions. The process was almost certainly killed by the Linux OOM killer (SIGKILL) when the system ran out of RAM. SIGKILL cannot be caught; the process disappears without running exception handlers. -
Memory sequence:
- After D20: rss_max_mb=11789 (~11.8 GB) with
noyau_post_D20.jsonwritten (156 M residues, 1.77 GB on disk). - F16 starts and loads
noyau_post_D20.json. An initial fix used stream load (ijson) with--modulo 9so only residues withr % 9 == 0are kept (~17 M residues). That still allocates a single list of ~17 M Python integers (on the order of several GB), so OOM can still occur on a 16 GB machine when combined with the rest of the process and Cursor. - A second fix uses chunked stream load: the noyau is streamed in chunks (e.g. 1.5 M residues per chunk); each chunk is passed to
build_fusion_clauses()and only the output rows are accumulated. No single list of all filtered residues is ever built, so peak RSS stays bounded.
- After D20: rss_max_mb=11789 (~11.8 GB) with
-
Why Cursor crashes: Cursor and the pipeline share the same machine RAM. When the pipeline’s memory spikes during F16 load, either the Python process is killed (and Cursor stays up but the run “crashes”) or the system is so starved that the OOM killer also kills Cursor, or the machine becomes unresponsive and Cursor appears to crash.
Corrective actions
- Run the extended pipeline outside Cursor: Use a standalone terminal (or SSH session, or
nohupin a separate terminal) so Cursor is not in the same memory space. Example:- From a separate terminal:
cd /home/ncantu/code/algo/applications/collatz && ./scripts/08-paliers-finale.sh - Or:
nohup ./scripts/08-paliers-finale.sh > out/run.log 2>&1 &
- From a separate terminal:
- Ensure enough free RAM before F16 (e.g. 20+ GB free, or close other heavy apps) if running on the same machine as Cursor.
- Resume from D20 if D18–D20 are already done:
RESUME_FROM=D20 ./scripts/08-paliers-finale.shstill loadsnoyau_post_F15then runs D20, then F16. To skip straight to F16 you would need a new option (e.g.RESUME_FROM=F16) andnoyau_post_D20already present; currently not implemented.
Impact
- D18, D19, F15, D20 complete successfully; artefacts are in
out/noyaux/andout/candidats/. - F16 and D21 never run; Cursor can crash when the pipeline is started from inside Cursor on a RAM-limited machine.
Analysis modalities
- Inspect last lines:
tail -30 out/pipeline_extend.log. - Check for OOM in kernel logs:
dmesg | grep -i out.of.memoryorjournalctl -k -b | grep -i oom(if available). - Monitor RSS during run:
watch -n 5 'ps -o rss= -p $(pgrep -f "collatz_k_pipeline")'(RSS in KB).
Deployment
Run the script outside the Cursor process so that memory pressure does not kill Cursor. Code fix (two steps):
-
Stream load (already in place)
When the noyau file is >500 MB and--modulois set, the fusion pipeline usesijsonto stream-parse the JSON and keep only residues withr % modulo == 0, instead of loading the full file withjson.loads(). Install:pip3 install -r collatz_k_scripts/requirements.txt. -
Chunked processing (added after OOM persisted)
For noyau files >500 MB with modulo set, the pipeline no longer builds a single list of all filtered residues. It uses_stream_load_noyau_modulo_chunked()to yield chunks (default 800k residues). For each chunk it runsbuild_fusion_clauses(), then appends the rows to the output CSV. Peak memory stays bounded by one chunk plus the audit state maps and the merged rows. F16 withnoyau_post_D20.json(~1.7 GB, modulo 9) now completes and writes the fusion CSV. -
run_update_noyau stream path (post-F16 OOM)
After F16, the pipeline callsrun_update_noyau(cert_f16, noyau_post_D20, noyau_post_F16). That step was loading the fullnoyau_post_D20.json(1.7 GB, 156 M residues) withread_text()+json.loads(), causing OOM. For previous-noyau files >500 MB,run_update_noyaunow uses_get_palier_from_tail()(read last 128 bytes to get palier) and_stream_update_noyau(): stream-parse the noyau with ijson, keep only residues not in the covered set (from the cert), and stream-write the new noyau JSON. No full noyau list is ever materialized. -
run_single_palier stream path (D21 OOM)
D21 loadsnoyau_post_F16.json(~1.7 GB, ~156 M residues). Loading it fully inrun_single_paliercaused OOM. For input noyau files >500 MB,run_single_paliernow uses_run_single_palier_stream: (1) stream pass to compute max_r and count; (2) stream pass to build cand and cover sets; (3) write CSV from cand; (4) stream pass to write residual noyau (only cover set in memory, residual written incrementally). No full residue list or full residual list is materialized.