Evolve summarizer into AI service with OCR support

2026-03-23 20:12:34 +01:00
parent 90fdd8e1a5
commit 653f713a78
20 changed files with 475 additions and 129 deletions
@@ -1,16 +1,22 @@
-# Local Hugging Face Summarizer
+# Local AI Service

-This small service runs a Hugging Face summarization model locally and exposes a simple HTTP API.
+This service runs a local Hugging Face summarization model and also exposes document text extraction with OCR for supported PDFs and images.

-Install (recommended: virtualenv)
+## Capabilities
+- job/role summarization
+- PDF text extraction
+- OCR fallback for scanned PDFs
+- OCR for image uploads (`png`, `jpg`, `jpeg`, `webp`)
+- DOCX / TXT / MD extraction

-Windows (CPU PyTorch wheel may be required):
+## Install
+
+Windows:

 ```powershell
 python -m venv .venv
 .\.venv\Scripts\Activate.ps1
 pip install -r requirements.txt
-# If torch wheel installation is needed, follow instructions at https://pytorch.org
 python -m uvicorn app:app --host 127.0.0.1 --port 8001 --workers 1
 ```

@@ -23,10 +29,15 @@ pip install -r requirements.txt
 python -m uvicorn app:app --host 127.0.0.1 --port 8001 --workers 1
 ```

-API
- `GET /health` — health check
- `POST /summarize` — JSON body `{ "text": "...", "max_length": 150, "min_length": 30 }` returns `{ "summary": "...", "cached": false }`
+## Docker
+The Dockerfile installs Tesseract OCR so scanned PDFs and supported images can be processed inside the container.

-Notes
- Model will be downloaded on first run and can be several hundred MB.
- For lower memory usage, consider `sshleifer/tiny-distilbart-cnn-6-6` or `t5-small`.
+## API
+- `GET /health` — health check and runtime capabilities
+- `POST /summarize` — JSON body `{ "text": "...", "max_length": 150, "min_length": 30 }`
+- `POST /extract-text` — multipart file upload, returns extracted text and OCR metadata
+
+## Notes
+- Model weights are downloaded on first run.
+- OCR quality depends on scan quality and language support.
+- Default OCR language is English (`eng`).