4NK 14c974f54c Add smart-ide-tools-bridge API for submodule tools + central local env

- New service: tools bridge (port 37147) registry + Carbonyl/PageIndex/Chandra POST jobs
- config/services.local.env.example and gitignore for services.local.env
- .env.example for repos-devtools, regex-search, ia-dev-gateway, orchestrator, claw proxy, langextract
- Orchestrator intents: tools.registry, tools.carbonyl.plan, tools.pageindex.run, tools.chandra.ocr
- Docs: API + repo service fiche, architecture index; do not commit dist/

2026-04-03 22:35:57 +02:00

2.9 KiB

Raw Permalink Blame History

Chandra OCR (amont)

Chandra OCR 2 convertit images et PDF en Markdown, HTML ou JSON en conservant la mise en page (tableaux, formulaires, écriture manuscrite, math). Code sous Apache-2.0 ; les poids du modèle suivent une licence dédiée (MODEL_LICENSE dans upstream/) — voir le dépôt amont.

Ce répertoire services/chandra/ contient :

upstream/ : sous-module Git vers datalab-to/chandra.
install-local-hf.sh : installe les dépendances Hugging Face (Torch, Transformers, etc.) dans upstream/.venv.
run-chandra-hf.sh : lance la CLI avec --method hf (inférence locale).
run-chandra.sh : lance chandra tel quel (passer --method vllm ou hf).
.env.example : variables pour upstream/local.env (modèle, GPU, tokens).

Configuration locale (Hugging Face)

Une fois le sous-module présent :

cd services/chandra
./install-local-hf.sh
cp .env.example upstream/local.env
# Éditer upstream/local.env si besoin (TORCH_DEVICE, MAX_OUTPUT_TOKENS, HF_TOKEN).

GPU : laisser TORCH_DEVICE vide pour device_map="auto" (comportement amont), ou fixer par ex. TORCH_DEVICE=cuda:0.
CPU : possible mais lent ; indiquer TORCH_DEVICE=cpu.
Le modèle MODEL_CHECKPOINT est téléchargé depuis Hugging Face au premier run (connexion requise ; espace disque important).

L’amont recommande flash-attention pour de meilleures perfs GPU ; après installation, TORCH_ATTN=flash_attention_2 dans local.env.

Usage (HF local)

cd services/chandra
./run-chandra-hf.sh /chemin/document.pdf ./sortie_ocr
# répertoire d’entrée :
./run-chandra-hf.sh /chemin/documents ./sortie_ocr

Équivalent : ./run-chandra.sh … --method hf.

Usage (vLLM, optionnel)

Si tu préfères un serveur vLLM plutôt que le chargement local du modèle :

cd services/chandra/upstream
uv sync
# puis démarrer chandra_vllm selon README amont
cd ..
./run-chandra.sh input.pdf ./output --method vllm

Rôle dans smart_ide

OCR / numérisation structurée pour pipelines documentaires, en amont de PageIndex (PageIndex) ou d’AnythingLLM / docv.
Pas de service HTTP dans ce dépôt : exécution CLI (comme services/pageindex/).

API IDE : OCR via smart-ide-tools-bridge — POST /v1/chandra/ocr — docs/API/smart-ide-tools-bridge-api.md.

Documentation : docs/repo/service-chandra.md, docs/features/chandra-ocr-documents.md.

Ressources amont

Dépôt : datalab-to/chandra
Paquet PyPI : chandra-ocr (alternative : pip install chandra-ocr[hf])

2.9 KiB Raw Permalink Blame History Unescape Escape