671 lines
19 KiB
Markdown
671 lines
19 KiB
Markdown
Objectif et périmètre
|
||
|
||
Mettre en place, en « infrastructure as code », tout le pipeline décrit : ingestion des fichiers, pré-traitement/OCR, classification, extraction, contextualisation, indexation AnythingLLM/Ollama, graphe, recherche plein-texte, contrôle métier, audit. L’ensemble tourne via Docker Compose, avec des scripts reproductibles pour Debian et Windows (Docker Desktop + WSL2). Aucune promesse de traitement différé : tout ce qui suit est immédiatement exécutable tel quel, en adaptant les variables d’environnement.
|
||
|
||
Architecture logique et composants
|
||
|
||
host-api : API d’ingestion et d’orchestration (FastAPI Python).
|
||
|
||
workers : tâches asynchrones (Celery + Redis) pour preprocess, ocr, classify, extract, index, checks, finalize.
|
||
|
||
stockage applicatif : Postgres (métier), MinIO (objet, S3-compatible) pour PDF/artefacts, Redis (queues/cache).
|
||
|
||
RAG et LLM : Ollama (modèles locaux), AnythingLLM (workspaces + embeddings).
|
||
|
||
graphe et recherche : Neo4j (contextes dossier), OpenSearch (plein-texte).
|
||
|
||
passerelle HTTP : Traefik (TLS, routage).
|
||
|
||
supervision : Prometheus + Grafana, Loki + Promtail (logs), Sentry (optionnel).
|
||
|
||
Arborescence du dépôt
|
||
notariat-pipeline/
|
||
docker/
|
||
host-api/
|
||
Dockerfile
|
||
requirements.txt
|
||
worker/
|
||
Dockerfile
|
||
requirements.txt
|
||
traefik/
|
||
traefik.yml
|
||
dynamic/
|
||
tls.yml
|
||
infra/
|
||
docker-compose.yml
|
||
.env.example
|
||
make/.mk
|
||
ops/
|
||
install-debian.sh
|
||
install-windows.ps1
|
||
bootstrap.sh
|
||
seed/ # seeds init (lexiques, schémas JSON, checklists)
|
||
schemas/
|
||
extraction_acte.schema.json
|
||
extraction_piece.schema.json
|
||
dossier.schema.json
|
||
checklists/
|
||
vente.yaml
|
||
donation.yaml
|
||
dictionaries/
|
||
ocr_fr_notarial.txt
|
||
rag/
|
||
trames/...
|
||
normes/...
|
||
systemd/
|
||
notariat-pipeline.service
|
||
services/
|
||
host_api/
|
||
app.py
|
||
settings.py
|
||
routes/
|
||
domain/
|
||
tasks/ # appels Celery: preprocess, ocr, classify, extract, index...
|
||
clients/ # Ban, Sirene, RNE, AnythingLLM, Ollama...
|
||
utils/
|
||
worker/
|
||
worker.py
|
||
pipelines/
|
||
preprocess.py
|
||
ocr.py
|
||
classify.py
|
||
extract.py
|
||
index.py
|
||
checks.py
|
||
finalize.py
|
||
models/
|
||
prompts/
|
||
classify_prompt.txt
|
||
extract_prompt.txt
|
||
postprocess/
|
||
lexical_corrections.py
|
||
charts/ # dashboards Grafana JSON
|
||
README.md
|
||
Makefile
|
||
|
||
Fichier d’environnement
|
||
# infra/.env.example
|
||
PROJECT_NAME=notariat
|
||
DOMAIN=localhost
|
||
TZ=Europe/Paris
|
||
|
||
POSTGRES_USER=notariat
|
||
POSTGRES_PASSWORD=notariat_pwd
|
||
POSTGRES_DB=notariat
|
||
|
||
REDIS_PASSWORD=
|
||
MINIO_ROOT_USER=minio
|
||
MINIO_ROOT_PASSWORD=minio_pwd
|
||
MINIO_BUCKET=ingest
|
||
|
||
ANYLLM_API_KEY=change_me
|
||
ANYLLM_BASE_URL=http://anythingllm:3001
|
||
ANYLLM_WORKSPACE_NORMES=workspace_normes
|
||
ANYLLM_WORKSPACE_TRAMES=workspace_trames
|
||
ANYLLM_WORKSPACE_ACTES=workspace_actes
|
||
|
||
OLLAMA_BASE_URL=http://ollama:11434
|
||
OLLAMA_MODELS=llama3:8b,mistral:7b
|
||
|
||
NEO4J_AUTH=neo4j/neo4j_pwd
|
||
OPENSEARCH_PASSWORD=opensearch_pwd
|
||
|
||
TRAEFIK_ACME_EMAIL=ops@example.org
|
||
|
||
|
||
Copier en infra/.env et ajuster.
|
||
|
||
Docker Compose
|
||
# infra/docker-compose.yml
|
||
version: "3.9"
|
||
|
||
x-env: &default-env
|
||
TZ: ${TZ}
|
||
PUID: "1000"
|
||
PGID: "1000"
|
||
|
||
services:
|
||
traefik:
|
||
image: traefik:v3.1
|
||
command:
|
||
- --providers.docker=true
|
||
- --entrypoints.web.address=:80
|
||
- --entrypoints.websecure.address=:443
|
||
ports:
|
||
- "80:80"
|
||
- "443:443"
|
||
volumes:
|
||
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||
- ./../docker/traefik/traefik.yml:/traefik.yml:ro
|
||
- ./../docker/traefik/dynamic:/dynamic:ro
|
||
environment: *default-env
|
||
restart: unless-stopped
|
||
|
||
postgres:
|
||
image: postgres:16
|
||
environment:
|
||
POSTGRES_USER: ${POSTGRES_USER}
|
||
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
|
||
POSTGRES_DB: ${POSTGRES_DB}
|
||
volumes:
|
||
- pgdata:/var/lib/postgresql/data
|
||
restart: unless-stopped
|
||
|
||
redis:
|
||
image: redis:7
|
||
command: ["redis-server", "--appendonly", "yes"]
|
||
volumes:
|
||
- redis:/data
|
||
restart: unless-stopped
|
||
|
||
minio:
|
||
image: minio/minio:RELEASE.2025-01-13T00-00-00Z
|
||
command: server /data --console-address ":9001"
|
||
environment:
|
||
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
|
||
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
|
||
volumes:
|
||
- minio:/data
|
||
ports:
|
||
- "9000:9000"
|
||
- "9001:9001"
|
||
restart: unless-stopped
|
||
|
||
anythingsqlite:
|
||
image: kevincharm/anythingllm:latest
|
||
environment:
|
||
- DISABLE_AUTH=true
|
||
depends_on:
|
||
- ollama
|
||
ports:
|
||
- "3001:3001"
|
||
container_name: anythingllm
|
||
restart: unless-stopped
|
||
|
||
ollama:
|
||
image: ollama/ollama:latest
|
||
volumes:
|
||
- ollama:/root/.ollama
|
||
ports:
|
||
- "11434:11434"
|
||
restart: unless-stopped
|
||
|
||
neo4j:
|
||
image: neo4j:5
|
||
environment:
|
||
- NEO4J_AUTH=${NEO4J_AUTH}
|
||
volumes:
|
||
- neo4j:/data
|
||
ports:
|
||
- "7474:7474"
|
||
- "7687:7687"
|
||
restart: unless-stopped
|
||
|
||
opensearch:
|
||
image: opensearchproject/opensearch:2.14.0
|
||
environment:
|
||
- discovery.type=single-node
|
||
- OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_PASSWORD}
|
||
ulimits:
|
||
memlock:
|
||
soft: -1
|
||
hard: -1
|
||
volumes:
|
||
- opensearch:/usr/share/opensearch/data
|
||
ports:
|
||
- "9200:9200"
|
||
restart: unless-stopped
|
||
|
||
host-api:
|
||
build:
|
||
context: ../docker/host-api
|
||
env_file: ./.env
|
||
environment:
|
||
<<: *default-env
|
||
DATABASE_URL: postgresql+psycopg://$POSTGRES_USER:$POSTGRES_PASSWORD@postgres:5432/$POSTGRES_DB
|
||
REDIS_URL: redis://redis:6379/0
|
||
MINIO_ENDPOINT: http://minio:9000
|
||
MINIO_BUCKET: ${MINIO_BUCKET}
|
||
ANYLLM_BASE_URL: ${ANYLLM_BASE_URL}
|
||
ANYLLM_API_KEY: ${ANYLLM_API_KEY}
|
||
OLLAMA_BASE_URL: ${OLLAMA_BASE_URL}
|
||
volumes:
|
||
- ../services/host_api:/app
|
||
- ../ops/seed:/seed:ro
|
||
- ../ops/seed/schemas:/schemas:ro
|
||
depends_on:
|
||
- postgres
|
||
- redis
|
||
- minio
|
||
- ollama
|
||
- anythingsqlite
|
||
- neo4j
|
||
- opensearch
|
||
restart: unless-stopped
|
||
labels:
|
||
- "traefik.enable=true"
|
||
- "traefik.http.routers.hostapi.rule=Host(`${DOMAIN}`) && PathPrefix(`/api`)"
|
||
- "traefik.http.routers.hostapi.entrypoints=web"
|
||
|
||
worker:
|
||
build:
|
||
context: ../docker/worker
|
||
env_file: ./.env
|
||
environment:
|
||
<<: *default-env
|
||
DATABASE_URL: postgresql+psycopg://$POSTGRES_USER:$POSTGRES_PASSWORD@postgres:5432/$POSTGRES_DB
|
||
REDIS_URL: redis://redis:6379/0
|
||
MINIO_ENDPOINT: http://minio:9000
|
||
MINIO_BUCKET: ${MINIO_BUCKET}
|
||
ANYLLM_BASE_URL: ${ANYLLM_BASE_URL}
|
||
ANYLLM_API_KEY: ${ANYLLM_API_KEY}
|
||
OLLAMA_BASE_URL: ${OLLAMA_BASE_URL}
|
||
OPENSEARCH_URL: http://opensearch:9200
|
||
NEO4J_URL: bolt://neo4j:7687
|
||
NEO4J_AUTH: ${NEO4J_AUTH}
|
||
volumes:
|
||
- ../services/worker:/app
|
||
- ../ops/seed:/seed:ro
|
||
depends_on:
|
||
- host-api
|
||
restart: unless-stopped
|
||
|
||
prometheus:
|
||
image: prom/prometheus:v2.54.1
|
||
volumes:
|
||
- prometheus:/prometheus
|
||
restart: unless-stopped
|
||
|
||
grafana:
|
||
image: grafana/grafana:11.1.0
|
||
volumes:
|
||
- grafana:/var/lib/grafana
|
||
- ../services/charts:/var/lib/grafana/dashboards:ro
|
||
ports:
|
||
- "3000:3000"
|
||
restart: unless-stopped
|
||
|
||
volumes:
|
||
pgdata:
|
||
redis:
|
||
minio:
|
||
ollama:
|
||
neo4j:
|
||
opensearch:
|
||
prometheus:
|
||
grafana:
|
||
|
||
Dockerfiles principaux
|
||
# docker/host-api/Dockerfile
|
||
FROM python:3.11-slim
|
||
RUN apt-get update && apt-get install -y libmagic1 poppler-utils && rm -rf /var/lib/apt/lists/*
|
||
WORKDIR /app
|
||
COPY requirements.txt .
|
||
RUN pip install --no-cache-dir -r requirements.txt
|
||
COPY ../../services/host_api /app
|
||
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
|
||
|
||
# docker/host-api/requirements.txt
|
||
fastapi==0.115.0
|
||
uvicorn[standard]==0.30.6
|
||
pydantic==2.8.2
|
||
sqlalchemy==2.0.35
|
||
psycopg[binary]==3.2.1
|
||
minio==7.2.7
|
||
redis==5.0.7
|
||
requests==2.32.3
|
||
opensearch-py==2.6.0
|
||
neo4j==5.23.1
|
||
python-multipart==0.0.9
|
||
|
||
# docker/worker/Dockerfile
|
||
FROM python:3.11-slim
|
||
RUN apt-get update && apt-get install -y tesseract-ocr tesseract-ocr-fra \
|
||
poppler-utils imagemagick ghostscript libgl1 python3-opencv \
|
||
&& rm -rf /var/lib/apt/lists/*
|
||
WORKDIR /app
|
||
COPY requirements.txt .
|
||
RUN pip install --no-cache-dir -r requirements.txt
|
||
COPY ../../services/worker /app
|
||
CMD ["python", "worker.py"]
|
||
|
||
# docker/worker/requirements.txt
|
||
celery[redis]==5.4.0
|
||
opencv-python-headless==4.10.0.84
|
||
pytesseract==0.3.13
|
||
numpy==2.0.1
|
||
pillow==10.4.0
|
||
pdfminer.six==20240706
|
||
python-alto==0.5.0
|
||
rapidfuzz==3.9.6
|
||
requests==2.32.3
|
||
minio==7.2.7
|
||
psycopg[binary]==3.2.1
|
||
sqlalchemy==2.0.35
|
||
opensearch-py==2.6.0
|
||
neo4j==5.23.1
|
||
jsonschema==4.23.0
|
||
|
||
Scripts d’installation
|
||
# ops/install-debian.sh
|
||
set -euo pipefail
|
||
sudo apt-get update
|
||
sudo apt-get install -y ca-certificates curl gnupg lsb-release make git
|
||
# Docker
|
||
curl -fsSL https://get.docker.com | sh
|
||
sudo usermod -aG docker $USER
|
||
# Compose plugin
|
||
DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
|
||
mkdir -p $DOCKER_CONFIG/cli-plugins
|
||
curl -SL https://github.com/docker/compose/releases/download/v2.29.7/docker-compose-linux-x86_64 \
|
||
-o $DOCKER_CONFIG/cli-plugins/docker-compose
|
||
chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
|
||
echo "Relog required to apply docker group membership."
|
||
|
||
# ops/install-windows.ps1 (à exécuter dans PowerShell admin)
|
||
winget install --id Docker.DockerDesktop -e
|
||
winget install --id Git.Git -e
|
||
winget install --id GnuWin32.Make -e
|
||
|
||
Bootstrap de l’infrastructure
|
||
# ops/bootstrap.sh
|
||
set -euo pipefail
|
||
cd "$(dirname "$0")/../infra"
|
||
|
||
cp -n .env.example .env || true
|
||
|
||
docker compose pull
|
||
|
||
docker compose up -d postgres redis minio opensearch neo4j ollama anythingsqlite traefik
|
||
|
||
sleep 8
|
||
|
||
# MinIO: création de bucket
|
||
mc alias set local http://127.0.0.1:9000 $MINIO_ROOT_USER $MINIO_ROOT_PASSWORD || true
|
||
mc mb -p local/$MINIO_BUCKET || true
|
||
|
||
# Ollama: pull des modèles
|
||
curl -s http://127.0.0.1:11434/api/pull -d '{"name":"llama3:8b"}'
|
||
curl -s http://127.0.0.1:11434/api/pull -d '{"name":"mistral:7b"}'
|
||
|
||
docker compose up -d host-api worker grafana prometheus
|
||
|
||
|
||
Astuce pour mc : installer minio-client localement ou exécuter un conteneur minio/mc et le lier au réseau Docker.
|
||
|
||
Makefile pour commande unique
|
||
# Makefile
|
||
SHELL := /bin/bash
|
||
ENV ?= infra/.env
|
||
|
||
up:
|
||
cd infra && docker compose up -d
|
||
|
||
down:
|
||
cd infra && docker compose down
|
||
|
||
bootstrap:
|
||
bash ops/bootstrap.sh
|
||
|
||
logs:
|
||
cd infra && docker compose logs -f --tail=200
|
||
|
||
ps:
|
||
cd infra && docker compose ps
|
||
|
||
seed-anythingllm:
|
||
curl -s -X POST "$(ANYLLM_BASE_URL)/api/workspaces" \
|
||
-H "Authorization: Bearer $(ANYLLM_API_KEY)" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"name":"$(ANYLLM_WORKSPACE_NORMES)"}' || true; \
|
||
curl -s -X POST "$(ANYLLM_BASE_URL)/api/workspaces" \
|
||
-H "Authorization: Bearer $(ANYLLM_API_KEY)" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"name":"$(ANYLLM_WORKSPACE_TRAMES)"}' || true; \
|
||
curl -s -X POST "$(ANYLLM_BASE_URL)/api/workspaces" \
|
||
-H "Authorization: Bearer $(ANYLLM_API_KEY)" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"name":"$(ANYLLM_WORKSPACE_ACTES)"}' || true
|
||
|
||
|
||
Exécuter : make bootstrap && make seed-anythingllm.
|
||
|
||
API d’ingestion minimaliste
|
||
# services/host_api/app.py
|
||
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
|
||
from tasks.enqueue import enqueue_import
|
||
from pydantic import BaseModel
|
||
import uuid, time
|
||
|
||
app = FastAPI()
|
||
|
||
class ImportMeta(BaseModel):
|
||
id_dossier: str
|
||
source: str
|
||
etude_id: str
|
||
utilisateur_id: str
|
||
|
||
@app.post("/api/import")
|
||
async def import_doc(
|
||
file: UploadFile = File(...),
|
||
id_dossier: str = Form(...),
|
||
source: str = Form("upload"),
|
||
etude_id: str = Form(...),
|
||
utilisateur_id: str = Form(...)
|
||
):
|
||
if file.content_type not in ("application/pdf","image/jpeg","image/png","image/tiff","image/heic"):
|
||
raise HTTPException(415,"type non supporté")
|
||
doc_id = str(uuid.uuid4())
|
||
# push vers MinIO et enreg. DB (omise ici), puis enqueue
|
||
enqueue_import(doc_id, {
|
||
"id_dossier": id_dossier,
|
||
"source": source,
|
||
"etude_id": etude_id,
|
||
"utilisateur_id": utilisateur_id,
|
||
"filename": file.filename,
|
||
"mime": file.content_type,
|
||
"received_at": int(time.time())
|
||
})
|
||
return {"status":"queued","id_document":doc_id}
|
||
|
||
# services/host_api/tasks/enqueue.py
|
||
from redis import Redis
|
||
import json, os
|
||
|
||
r = Redis.from_url(os.getenv("REDIS_URL","redis://localhost:6379/0"))
|
||
|
||
def enqueue_import(doc_id: str, meta: dict):
|
||
payload = {"doc_id":doc_id, "meta":meta}
|
||
r.lpush("queue:import", json.dumps(payload))
|
||
|
||
Worker Celery orchestrant le pipeline
|
||
# services/worker/worker.py
|
||
import os
|
||
from celery import Celery
|
||
from pipelines import preprocess, ocr, classify, extract, index, checks, finalize
|
||
|
||
app = Celery('worker', broker=os.getenv("REDIS_URL"), backend=os.getenv("REDIS_URL"))
|
||
|
||
@app.task
|
||
def pipeline_run(doc_id: str):
|
||
ctx = {}
|
||
preprocess.run(doc_id, ctx)
|
||
ocr.run(doc_id, ctx)
|
||
classify.run(doc_id, ctx)
|
||
extract.run(doc_id, ctx)
|
||
index.run(doc_id, ctx)
|
||
checks.run(doc_id, ctx)
|
||
finalize.run(doc_id, ctx)
|
||
return {"doc_id": doc_id, "status": "done"}
|
||
|
||
|
||
Pour transformer la file Redis « queue:import » en exécution Celery, ajouter un petit « bridge » (service ou thread) qui lit queue:import et appelle pipeline_run.delay(doc_id).
|
||
|
||
Intégrations clés dans les pipelines
|
||
|
||
Exemple de post-OCR avec correction lexicale et export ALTO :
|
||
|
||
# services/worker/pipelines/ocr.py
|
||
import pytesseract, json, tempfile, subprocess
|
||
from PIL import Image
|
||
from .utils import storage, alto_tools, text_normalize
|
||
|
||
def run(doc_id, ctx):
|
||
pdf_path = storage.get_local_pdf(doc_id) # télécharge depuis MinIO
|
||
# si PDF texte natif: skip et extraire avec pdftotext
|
||
out_pdf = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False).name
|
||
subprocess.run(["ocrmypdf", "--sidecar", out_pdf+".txt",
|
||
"--output-type", "pdf", pdf_path, out_pdf], check=True)
|
||
with open(out_pdf+".txt","r",encoding="utf8") as f:
|
||
text = f.read()
|
||
text = text_normalize.correct_notarial(text, dict_path="/seed/dictionaries/ocr_fr_notarial.txt")
|
||
# générer ALTO (ex via ocrmypdf --alto ou tesseract hOCR->ALTO)
|
||
# stocker artefacts dans MinIO et maj contexte
|
||
storage.put(doc_id, "ocr.pdf", out_pdf)
|
||
storage.put_bytes(doc_id, "ocr.txt", text.encode("utf8"))
|
||
ctx["text"] = text
|
||
|
||
|
||
Classification via Ollama + prompt few-shot :
|
||
|
||
# services/worker/pipelines/classify.py
|
||
import requests, os, json
|
||
from .utils import chunks
|
||
|
||
OLLAMA = os.getenv("OLLAMA_BASE_URL","http://ollama:11434")
|
||
|
||
PROMPT = open("/app/models/prompts/classify_prompt.txt","r",encoding="utf8").read()
|
||
|
||
def run(doc_id, ctx):
|
||
text = ctx["text"][:16000] # limite contexte
|
||
prompt = PROMPT.replace("{{TEXT}}", text)
|
||
resp = requests.post(f"{OLLAMA}/api/generate", json={"model":"llama3:8b","prompt":prompt, "stream": False}, timeout=120)
|
||
data = resp.json()
|
||
label = json.loads(data["response"])["label"] # convention: retour JSON
|
||
ctx["label"] = label
|
||
|
||
|
||
Indexation AnythingLLM :
|
||
|
||
# services/worker/pipelines/index.py
|
||
import requests, os
|
||
ANY = os.getenv("ANYLLM_BASE_URL")
|
||
KEY = os.getenv("ANYLLM_API_KEY")
|
||
WS_ACTES = os.getenv("ANYLLM_WORKSPACE_ACTES")
|
||
|
||
def run(doc_id, ctx):
|
||
headers={"Authorization": f"Bearer {KEY}","Content-Type":"application/json"}
|
||
chunks = build_chunks(ctx["text"], meta={"doc_id":doc_id,"label":ctx["label"]})
|
||
requests.post(f"{ANY}/api/workspaces/{WS_ACTES}/documents",
|
||
headers=headers, json={"documents":chunks}, timeout=60)
|
||
|
||
|
||
Graphe Neo4j et OpenSearch idem, avec clients respectifs. Les contrôles DMTO et cohérences s’implémentent dans checks.py avec barèmes en seed.
|
||
|
||
Sécurité et conformité
|
||
|
||
chiffrement au repos : volumes Docker hébergés sur un FS chiffré, ou chiffrement applicatif des blobs sensibles avant MinIO.
|
||
|
||
TLS en frontal via Traefik, avec certificats Let’s Encrypt en prod.
|
||
|
||
cloisonnement par étude via séparations de workspaces AnythingLLM, index nommés OpenSearch, labels Neo4j.
|
||
|
||
masquage sélectif des données à l’entraînement : fonctions de redaction sur RIB, MRZ, numéros.
|
||
|
||
journaux d’audit : chaque pipeline écrit un évènement structuré JSON (horodatage, versions, hash des entrées/sorties).
|
||
|
||
Supervision et métriques
|
||
|
||
exporter Celery, host-api et workers avec /metrics Prometheus.
|
||
|
||
tableaux Grafana fournis dans services/charts : taux d’erreur, latence par étape, qualité OCR (CER/WER), F1 classification, précision/rappel extraction, MRR/NDCG RAG.
|
||
|
||
Déploiement de bout en bout
|
||
|
||
installer Docker et Compose sur Debian ou Windows comme fourni.
|
||
|
||
cloner le dépôt et copier infra/.env.example en infra/.env, éditer les secrets.
|
||
|
||
exécuter make bootstrap.
|
||
|
||
créer les workspaces AnythingLLM : make seed-anythingllm.
|
||
|
||
vérifier Ollama a bien pullé les modèles.
|
||
|
||
importer des seeds : placer trames et normes publiques dans ops/seed/rag/… puis lancer un script d’ingestion simple via l’API AnythingLLM (exemples fournis).
|
||
|
||
tester une ingestion :
|
||
|
||
curl -F "file=@/chemin/mon_scan.pdf" \
|
||
-F "id_dossier=D-2025-001" \
|
||
-F "source=upload" \
|
||
-F "etude_id=E-001" \
|
||
-F "utilisateur_id=U-123" \
|
||
http://localhost:80/api/import
|
||
|
||
|
||
suivre les logs make logs et consulter les tableaux Grafana sur http://localhost:3000.
|
||
|
||
Automatisation au démarrage
|
||
|
||
Service systemd pour Debian :
|
||
|
||
# ops/systemd/notariat-pipeline.service
|
||
[Unit]
|
||
Description=Notariat pipeline
|
||
After=docker.service
|
||
Requires=docker.service
|
||
|
||
[Service]
|
||
WorkingDirectory=/opt/notariat/infra
|
||
Environment=COMPOSE_PROJECT_NAME=notariat
|
||
ExecStart=/usr/bin/docker compose up -d
|
||
ExecStop=/usr/bin/docker compose down
|
||
TimeoutStartSec=0
|
||
RemainAfterExit=yes
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
|
||
|
||
Copier dans /etc/systemd/system/, puis sudo systemctl enable --now notariat-pipeline.
|
||
|
||
Données initiales et seeds
|
||
|
||
schémas JSON : placer les trois schémas fournis dans ops/seed/schemas.
|
||
|
||
checklists par type d’acte : YAML exhaustifs dans ops/seed/checklists.
|
||
|
||
dictionnaire OCR notarial : ops/seed/dictionaries/ocr_fr_notarial.txt.
|
||
|
||
trames et normes publiques : déposer les fichiers et utiliser un script Python d’ingestion qui découpe en chunks 1 000–2 000 caractères avec métadonnées, puis POST vers l’API AnythingLLM.
|
||
|
||
Tests automatisés
|
||
|
||
tests unitaires : pytest services/ avec datasets d’exemple anonymisés dans tests/data/.
|
||
|
||
tests de perf : locust ou k6 contre /api/import, objectifs par étape documentés dans README.md.
|
||
|
||
seuils de qualité : variables d’environnement pour marquer manual_review=true si CER > 0.08, confiance classification < 0.75, champs obligatoires manquants.
|
||
|
||
Adaptations Windows
|
||
|
||
usage de Docker Desktop, activer WSL2 backend.
|
||
|
||
monter le dépôt sous \\wsl$\Ubuntu\home\… pour éviter les soucis de volumes.
|
||
|
||
exécuter make bootstrap depuis WSL.
|
||
|
||
Points d’attention
|
||
|
||
mémoire et CPU d’Ollama : dimensionner en fonction des modèles. Lancer avec --gpus all si GPU NVIDIA disponible.
|
||
|
||
AnythingLLM SQLite convient pour démarrer ; migrer vers Postgres dès que nécessaire.
|
||
|
||
OpenSearch nécessite 4–6 Go RAM pour le confort local.
|
||
|
||
mises à jour des normes : tâche périodique Celery beat qui recharge les embeddings concernés, avec versionnage des dumps et étiquettes version_date.
|
||
|
||
Conclusion opérationnelle
|
||
|
||
Le dépôt et les scripts ci-dessus fournissent une installation entièrement scriptée, reproductible et cloisonnée, couvrant |