1.3 KiB
Unpaywall + pdftotext — Design
Two improvements to the paper CLI: try free open-access sources before piracy, and try simple PDF text extraction before heavy ML OCR.
Download Pipeline
Priority chain for fetching the PDF:
- Unpaywall (new) — hit
https://api.unpaywall.org/v2/{doi}?email={email}, checkbest_oa_location.url_for_pdf, download if present - LibGen (existing)
- Anna's Archive (existing)
Unpaywall requires an email address (no API key). Read from UNPAYWALL_EMAIL env var. If unset, skip silently — same pattern as ANNAS_ARCHIVE_KEY.
Conversion Pipeline
Priority chain for PDF-to-markdown:
- pdftotext (new) — shell out to
pdftotext -layout <pdf> -, check output quality - marker-pdf (existing) — heavy ML OCR fallback
Quality heuristic for pdftotext output
- Length > 500 characters
-
80% printable ASCII / common unicode (letters, digits, punctuation, whitespace)
If either check fails, or pdftotext isn't on PATH, fall back to marker.
Nix changes
Add poppler_utils to the flake for pdftotext. Include in both the dev shell and the wrapped binary PATH.
Dependencies
No new Rust crate dependencies — reqwest and serde_json already handle the Unpaywall API call. pdftotext is an external binary like marker_single.