docs: update skill docs for unpaywall and pdftotext support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 13:35:18 -08:00 · 2026-02-25 13:35:18 -08:00 · 91920d4103
commit 91920d4103
parent c3b63ea2f5
3 changed files with 339 additions and 2 deletions
--- a/docs/plans/2026-02-25-unpaywall-and-pdftotext-design.md
+++ b/docs/plans/2026-02-25-unpaywall-and-pdftotext-design.md
@ -0,0 +1,35 @@
+# Unpaywall + pdftotext — Design
+
+Two improvements to the paper CLI: try free open-access sources before piracy, and try simple PDF text extraction before heavy ML OCR.
+
+## Download Pipeline
+
+Priority chain for fetching the PDF:
+
+1. **Unpaywall** (new) — hit `https://api.unpaywall.org/v2/{doi}?email={email}`, check `best_oa_location.url_for_pdf`, download if present
+2. **LibGen** (existing)
+3. **Anna's Archive** (existing)
+
+Unpaywall requires an email address (no API key). Read from `UNPAYWALL_EMAIL` env var. If unset, skip silently — same pattern as `ANNAS_ARCHIVE_KEY`.
+
+## Conversion Pipeline
+
+Priority chain for PDF-to-markdown:
+
+1. **pdftotext** (new) — shell out to `pdftotext -layout <pdf> -`, check output quality
+2. **marker-pdf** (existing) — heavy ML OCR fallback
+
+### Quality heuristic for pdftotext output
+
+- Length > 500 characters
+- >80% printable ASCII / common unicode (letters, digits, punctuation, whitespace)
+
+If either check fails, or pdftotext isn't on PATH, fall back to marker.
+
+## Nix changes
+
+Add `poppler_utils` to the flake for `pdftotext`. Include in both the dev shell and the wrapped binary PATH.
+
+## Dependencies
+
+No new Rust crate dependencies — `reqwest` and `serde_json` already handle the Unpaywall API call. `pdftotext` is an external binary like `marker_single`.