36 lines
1.3 KiB
Markdown
36 lines
1.3 KiB
Markdown
|
|
# Unpaywall + pdftotext — Design
|
||
|
|
|
||
|
|
Two improvements to the paper CLI: try free open-access sources before piracy, and try simple PDF text extraction before heavy ML OCR.
|
||
|
|
|
||
|
|
## Download Pipeline
|
||
|
|
|
||
|
|
Priority chain for fetching the PDF:
|
||
|
|
|
||
|
|
1. **Unpaywall** (new) — hit `https://api.unpaywall.org/v2/{doi}?email={email}`, check `best_oa_location.url_for_pdf`, download if present
|
||
|
|
2. **LibGen** (existing)
|
||
|
|
3. **Anna's Archive** (existing)
|
||
|
|
|
||
|
|
Unpaywall requires an email address (no API key). Read from `UNPAYWALL_EMAIL` env var. If unset, skip silently — same pattern as `ANNAS_ARCHIVE_KEY`.
|
||
|
|
|
||
|
|
## Conversion Pipeline
|
||
|
|
|
||
|
|
Priority chain for PDF-to-markdown:
|
||
|
|
|
||
|
|
1. **pdftotext** (new) — shell out to `pdftotext -layout <pdf> -`, check output quality
|
||
|
|
2. **marker-pdf** (existing) — heavy ML OCR fallback
|
||
|
|
|
||
|
|
### Quality heuristic for pdftotext output
|
||
|
|
|
||
|
|
- Length > 500 characters
|
||
|
|
- >80% printable ASCII / common unicode (letters, digits, punctuation, whitespace)
|
||
|
|
|
||
|
|
If either check fails, or pdftotext isn't on PATH, fall back to marker.
|
||
|
|
|
||
|
|
## Nix changes
|
||
|
|
|
||
|
|
Add `poppler_utils` to the flake for `pdftotext`. Include in both the dev shell and the wrapped binary PATH.
|
||
|
|
|
||
|
|
## Dependencies
|
||
|
|
|
||
|
|
No new Rust crate dependencies — `reqwest` and `serde_json` already handle the Unpaywall API call. `pdftotext` is an external binary like `marker_single`.
|