olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Support of formattings (strikethroughs, etc.)

Open knguyen1 opened this issue 5 days ago • 1 comments

🚀 The feature, motivation and pitch

I'm assuming this isn't supported out of the box? I tried this PDF with allenai/olmOCR-7B-0225-preview and did not get good results.

{"id": "033dae2f4c12b9b07d00a72702f03ac0639292e4", "text": "The quick fox jumps over the lazy, brave dog, or orangutan.", "source": "olmocr", "added": "2025-02-27", "created": "2025-02-27", "metadata": {"Source-File": "/workspaces/olmocr/tests/gnarly_pdfs/strikethrough_sample.pdf", "olmocr-version": "0.1.58", "pdf-total-pages": 1, "total-input-tokens": 1129, "total-output-tokens": 48, "total-fallback-pages": 0}, "attributes": {"pdf_page_numbers": [[0, 59, 1]]}}

From a sample PDF: The ~~quick~~ fox jumps over the ~~lazy~~brave ~~dog~~orangutan. 1

Image

Modified the prompt and added:

def build_finetuning_prompt(base_text: str) -> str:
    return (
        f"Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. "
        f"Just return the plain text representation of this document as if you were reading it naturally.\n"
        f"The text may or may not contain strike-throughs. Return only the text that is NOT struck through.\n"
        f"Do not hallucinate.\n"
        f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
    )

Alternatives

  • OpenAI will process this fine. If you instruct it to return non-struck-through texts. But this gets really expensive.

Additional context

No response

knguyen1 avatar Feb 27 '25 21:02 knguyen1