Methodology and Limitations

Last updated: April 2, 2026

How extraction works

PDFs are processed to recover readable document structure.
When available, OCR is applied for scanned or image-based pages.
Output is generated in markdown, plain text, and JSON formats.

Known limitations

Low-resolution scans and skewed pages can reduce OCR accuracy.
Complex merged-cell tables may require manual cleanup in edge cases.
Highly stylized forms may lose some visual context in text output.

Data handling policy

Uploaded input files are deleted after processing completes.
Generated outputs are retained for 24 hours, then purged automatically.
This page summarizes behavior at the time of the last update above.

Related pages