olmocr
olmocr copied to clipboard
Support of formattings (strikethroughs, etc.)
🚀 The feature, motivation and pitch
I'm assuming this isn't supported out of the box? I tried this PDF with allenai/olmOCR-7B-0225-preview
and did not get good results.
{"id": "033dae2f4c12b9b07d00a72702f03ac0639292e4", "text": "The quick fox jumps over the lazy, brave dog, or orangutan.", "source": "olmocr", "added": "2025-02-27", "created": "2025-02-27", "metadata": {"Source-File": "/workspaces/olmocr/tests/gnarly_pdfs/strikethrough_sample.pdf", "olmocr-version": "0.1.58", "pdf-total-pages": 1, "total-input-tokens": 1129, "total-output-tokens": 48, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 59, 1]]}}
From a sample PDF: The ~~quick~~ fox jumps over the ~~lazy~~brave ~~dog~~orangutan. 1
Modified the prompt and added:
def build_finetuning_prompt(base_text: str) -> str:
return (
f"Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. "
f"Just return the plain text representation of this document as if you were reading it naturally.\n"
f"The text may or may not contain strike-throughs. Return only the text that is NOT struck through.\n"
f"Do not hallucinate.\n"
f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
)
Alternatives
- OpenAI will process this fine. If you instruct it to return non-struck-through texts. But this gets really expensive.
Additional context
No response