Evolve summarizer into AI service with OCR support
This commit is contained in:
+22
-11
@@ -1,16 +1,22 @@
|
||||
# Local Hugging Face Summarizer
|
||||
# Local AI Service
|
||||
|
||||
This small service runs a Hugging Face summarization model locally and exposes a simple HTTP API.
|
||||
This service runs a local Hugging Face summarization model and also exposes document text extraction with OCR for supported PDFs and images.
|
||||
|
||||
Install (recommended: virtualenv)
|
||||
## Capabilities
|
||||
- job/role summarization
|
||||
- PDF text extraction
|
||||
- OCR fallback for scanned PDFs
|
||||
- OCR for image uploads (`png`, `jpg`, `jpeg`, `webp`)
|
||||
- DOCX / TXT / MD extraction
|
||||
|
||||
Windows (CPU PyTorch wheel may be required):
|
||||
## Install
|
||||
|
||||
Windows:
|
||||
|
||||
```powershell
|
||||
python -m venv .venv
|
||||
.\.venv\Scripts\Activate.ps1
|
||||
pip install -r requirements.txt
|
||||
# If torch wheel installation is needed, follow instructions at https://pytorch.org
|
||||
python -m uvicorn app:app --host 127.0.0.1 --port 8001 --workers 1
|
||||
```
|
||||
|
||||
@@ -23,10 +29,15 @@ pip install -r requirements.txt
|
||||
python -m uvicorn app:app --host 127.0.0.1 --port 8001 --workers 1
|
||||
```
|
||||
|
||||
API
|
||||
- `GET /health` — health check
|
||||
- `POST /summarize` — JSON body `{ "text": "...", "max_length": 150, "min_length": 30 }` returns `{ "summary": "...", "cached": false }`
|
||||
## Docker
|
||||
The Dockerfile installs Tesseract OCR so scanned PDFs and supported images can be processed inside the container.
|
||||
|
||||
Notes
|
||||
- Model will be downloaded on first run and can be several hundred MB.
|
||||
- For lower memory usage, consider `sshleifer/tiny-distilbart-cnn-6-6` or `t5-small`.
|
||||
## API
|
||||
- `GET /health` — health check and runtime capabilities
|
||||
- `POST /summarize` — JSON body `{ "text": "...", "max_length": 150, "min_length": 30 }`
|
||||
- `POST /extract-text` — multipart file upload, returns extracted text and OCR metadata
|
||||
|
||||
## Notes
|
||||
- Model weights are downloaded on first run.
|
||||
- OCR quality depends on scan quality and language support.
|
||||
- Default OCR language is English (`eng`).
|
||||
|
||||
Reference in New Issue
Block a user