Methodology and Limitations

Last updated: April 2, 2026

How extraction works

  • PDFs are processed to recover readable document structure.
  • When available, OCR is applied for scanned or image-based pages.
  • Output is generated in markdown, plain text, and JSON formats.

Known limitations

  • Low-resolution scans and skewed pages can reduce OCR accuracy.
  • Complex merged-cell tables may require manual cleanup in edge cases.
  • Highly stylized forms may lose some visual context in text output.

Data handling policy

  • Uploaded input files are deleted after processing completes.
  • Generated outputs are retained for 24 hours, then purged automatically.
  • This page summarizes behavior at the time of the last update above.

Related pages