**Motivations:** - Conserver l'état des scripts Collatz k, pipelines et démonstration - Documenter diagnostic D18/D21, errata, plan de preuve et correctif OOM paliers **Root causes:** - Consommation mémoire excessive (OOM) sur script paliers finale f16 **Correctifs:** - Documentation du crash OOM paliers finale f16 et pistes de correction **Evolutions:** - Évolutions des pipelines fusion/k, recover/update noyau, script 08-paliers-finale - Ajout de docs (diagnostic, errata, plan lemmes, fixKnowledge OOM) **Pages affectées:** - applications/collatz/collatz_k_scripts/*.py, note.md, requirements.txt - applications/collatz/collatz_k_scripts/*.md (diagnostic, errata, plan) - applications/collatz/scripts/08-paliers-finale.sh, README.md - docs/fixKnowledge/crash_paliers_finale_f16_oom.md
58 lines
5.2 KiB
Markdown
58 lines
5.2 KiB
Markdown
# Crash 08-paliers-finale / Cursor during F16
|
||
|
||
## Problem
|
||
|
||
The script `08-paliers-finale.sh` (extended pipeline D18→D21, F15, F16) crashes, and Cursor (which launched it) also crashes. No Python exception is logged; the last line in `out/pipeline_extend.log` is:
|
||
|
||
```
|
||
[2026-03-04 09:26:35] STEP start F16 fusion palier=2^35 rss_max_mb=11789
|
||
```
|
||
|
||
## Root cause
|
||
|
||
1. **Where it stops**: The process is killed during **F16** (fusion pipeline, palier 2^35), right after D20 completed successfully.
|
||
|
||
2. **Why there is no `[CRASH]` line**: The Python excepthook only runs on uncaught exceptions. The process was almost certainly killed by the **Linux OOM killer (SIGKILL)** when the system ran out of RAM. SIGKILL cannot be caught; the process disappears without running exception handlers.
|
||
|
||
3. **Memory sequence**:
|
||
- After D20: **rss_max_mb=11789** (~11.8 GB) with `noyau_post_D20.json` written (156 M residues, 1.77 GB on disk).
|
||
- F16 starts and loads `noyau_post_D20.json`. An initial fix used **stream load** (ijson) with `--modulo 9` so only residues with `r % 9 == 0` are kept (~17 M residues). That still allocates a single list of ~17 M Python integers (on the order of several GB), so **OOM can still occur** on a 16 GB machine when combined with the rest of the process and Cursor.
|
||
- A second fix uses **chunked stream load**: the noyau is streamed in chunks (e.g. 1.5 M residues per chunk); each chunk is passed to `build_fusion_clauses()` and only the output rows are accumulated. No single list of all filtered residues is ever built, so peak RSS stays bounded.
|
||
|
||
4. **Why Cursor crashes**: Cursor and the pipeline share the same machine RAM. When the pipeline’s memory spikes during F16 load, either the Python process is killed (and Cursor stays up but the run “crashes”) or the system is so starved that the OOM killer also kills Cursor, or the machine becomes unresponsive and Cursor appears to crash.
|
||
|
||
## Corrective actions
|
||
|
||
- **Run the extended pipeline outside Cursor**: Use a standalone terminal (or SSH session, or `nohup` in a separate terminal) so Cursor is not in the same memory space. Example:
|
||
- From a separate terminal: `cd /home/ncantu/code/algo/applications/collatz && ./scripts/08-paliers-finale.sh`
|
||
- Or: `nohup ./scripts/08-paliers-finale.sh > out/run.log 2>&1 &`
|
||
- **Ensure enough free RAM** before F16 (e.g. 20+ GB free, or close other heavy apps) if running on the same machine as Cursor.
|
||
- **Resume from D20** if D18–D20 are already done: `RESUME_FROM=D20 ./scripts/08-paliers-finale.sh` still loads `noyau_post_F15` then runs D20, then F16. To skip straight to F16 you would need a new option (e.g. `RESUME_FROM=F16`) and `noyau_post_D20` already present; currently not implemented.
|
||
|
||
## Impact
|
||
|
||
- D18, D19, F15, D20 complete successfully; artefacts are in `out/noyaux/` and `out/candidats/`.
|
||
- F16 and D21 never run; Cursor can crash when the pipeline is started from inside Cursor on a RAM-limited machine.
|
||
|
||
## Analysis modalities
|
||
|
||
- Inspect last lines: `tail -30 out/pipeline_extend.log`.
|
||
- Check for OOM in kernel logs: `dmesg | grep -i out.of.memory` or `journalctl -k -b | grep -i oom` (if available).
|
||
- Monitor RSS during run: `watch -n 5 'ps -o rss= -p $(pgrep -f "collatz_k_pipeline")'` (RSS in KB).
|
||
|
||
## Deployment
|
||
|
||
Run the script outside the Cursor process so that memory pressure does not kill Cursor. Code fix (two steps):
|
||
|
||
1. **Stream load (already in place)**
|
||
When the noyau file is >500 MB and `--modulo` is set, the fusion pipeline uses `ijson` to stream-parse the JSON and keep only residues with `r % modulo == 0`, instead of loading the full file with `json.loads()`. Install: `pip3 install -r collatz_k_scripts/requirements.txt`.
|
||
|
||
2. **Chunked processing (added after OOM persisted)**
|
||
For noyau files >500 MB with modulo set, the pipeline no longer builds a single list of all filtered residues. It uses `_stream_load_noyau_modulo_chunked()` to yield chunks (default 800k residues). For each chunk it runs `build_fusion_clauses()`, then appends the rows to the output CSV. Peak memory stays bounded by one chunk plus the audit state maps and the merged rows. F16 with `noyau_post_D20.json` (~1.7 GB, modulo 9) now completes and writes the fusion CSV.
|
||
|
||
3. **run_update_noyau stream path (post-F16 OOM)**
|
||
After F16, the pipeline calls `run_update_noyau(cert_f16, noyau_post_D20, noyau_post_F16)`. That step was loading the full `noyau_post_D20.json` (1.7 GB, 156 M residues) with `read_text()` + `json.loads()`, causing OOM. For previous-noyau files >500 MB, `run_update_noyau` now uses `_get_palier_from_tail()` (read last 128 bytes to get palier) and `_stream_update_noyau()`: stream-parse the noyau with ijson, keep only residues not in the covered set (from the cert), and stream-write the new noyau JSON. No full noyau list is ever materialized.
|
||
|
||
4. **run_single_palier stream path (D21 OOM)**
|
||
D21 loads `noyau_post_F16.json` (~1.7 GB, ~156 M residues). Loading it fully in `run_single_palier` caused OOM. For input noyau files >500 MB, `run_single_palier` now uses `_run_single_palier_stream`: (1) stream pass to compute max_r and count; (2) stream pass to build cand and cover sets; (3) write CSV from cand; (4) stream pass to write residual noyau (only cover set in memory, residual written incrementally). No full residue list or full residual list is materialized.
|