jobtrackingapp/tools/summarizer/README.md

# Local AI Service

This service runs a local Hugging Face summarization model and also exposes document text extraction with OCR for supported PDFs and images.

## Capabilities
- job/role summarization
- PDF text extraction
- OCR fallback for scanned PDFs
- OCR for image uploads (`png`, `jpg`, `jpeg`, `webp`)
- DOCX / TXT / MD extraction
- optional Ollama-backed CV block classification for harder sectioning

## Install

Windows:

```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python -m uvicorn app:app --host 127.0.0.1 --port 8001 --workers 1
```

Linux / macOS:

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m uvicorn app:app --host 127.0.0.1 --port 8001 --workers 1
```

If the host is missing `python3-venv` or `pip`, use the bootstrap script instead:

```bash
./scripts/bootstrap-and-test.sh bootstrap
```

## Docker
The Dockerfile installs Tesseract OCR so scanned PDFs and supported images can be processed inside the container.

## Tests

Run the summarizer unit tests with:

```bash
./scripts/bootstrap-and-test.sh test
```

The script:
- creates `.venv` with stdlib `venv` when available
- falls back to user-space `virtualenv` when host `venv` support is missing
- installs `requirements-dev.txt`
- writes pytest cache under `tmp/pytest-cache` to avoid stale root-owned `.pytest_cache` directories

## API
- `GET /health` — health check and runtime capabilities, including lazy model state (`model_loaded`, `model_disabled`, `summarize_available`, `model_load_error`) plus Ollama version/model metadata when configured
- `POST /summarize` — JSON body `{ "text": "...", "max_length": 150, "min_length": 30 }`
- `POST /extract-text` — multipart file upload, returns extracted text and OCR metadata
- `POST /cv/classify-block` — JSON body `{ "block": "..." }`, uses Ollama when `OLLAMA_MODEL` is configured

## Ollama
Set these before starting the service if you want the hybrid CV classifier enabled:

```bash
export OLLAMA_BASE_URL=http://ollama:11434
export OLLAMA_MODEL=qwen2.5:7b
```

Choose the model by setting `OLLAMA_MODEL` and then warming it with the helper script:

```bash
OLLAMA_MODEL=qwen2.5:7b ./scripts/start-ollama-cv.sh
```

Equivalent manual flow:

```bash
docker compose up -d ollama
docker compose exec ollama ollama pull qwen2.5:7b
docker compose up -d ai-service
```

- Model weights are downloaded on first pull.
- OCR quality depends on scan quality and language support.
- Default OCR language is English (`eng`).