paper-reader/docs/plans/2026-02-25-unpaywall-and-pdftotext-design.md
Ellie 91920d4103 docs: update skill docs for unpaywall and pdftotext support
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 13:35:18 -08:00

1.3 KiB

Unpaywall + pdftotext — Design

Two improvements to the paper CLI: try free open-access sources before piracy, and try simple PDF text extraction before heavy ML OCR.

Download Pipeline

Priority chain for fetching the PDF:

  1. Unpaywall (new) — hit https://api.unpaywall.org/v2/{doi}?email={email}, check best_oa_location.url_for_pdf, download if present
  2. LibGen (existing)
  3. Anna's Archive (existing)

Unpaywall requires an email address (no API key). Read from UNPAYWALL_EMAIL env var. If unset, skip silently — same pattern as ANNAS_ARCHIVE_KEY.

Conversion Pipeline

Priority chain for PDF-to-markdown:

  1. pdftotext (new) — shell out to pdftotext -layout <pdf> -, check output quality
  2. marker-pdf (existing) — heavy ML OCR fallback

Quality heuristic for pdftotext output

  • Length > 500 characters
  • 80% printable ASCII / common unicode (letters, digits, punctuation, whitespace)

If either check fails, or pdftotext isn't on PATH, fall back to marker.

Nix changes

Add poppler_utils to the flake for pdftotext. Include in both the dev shell and the wrapped binary PATH.

Dependencies

No new Rust crate dependencies — reqwest and serde_json already handle the Unpaywall API call. pdftotext is an external binary like marker_single.