olmocr OCR Conversion Fails to Detect Headings Properly

OCR Conversion Fails to Detect Headings Properly

Open Lilyzhangyanlin opened this issue 3 days ago • 4 comments

🚀 The feature, motivation and pitch

When converting PDFs to text using OCR, the tool has difficulty identifying and distinguishing headings, such as H1 (main headings) and H2 (subheadings). This leads to a loss of document structure, making the output less readable and harder to process for downstream applications.

Expected Behavior

The tool should correctly detect and differentiate between various heading levels (e.g., H1, H2, etc.). Headings should be formatted distinctly in the output text (e.g., with larger font sizes, bold text, or prefixed markers).

Current Behavior The OCR does not consistently recognize headings. All text appears in a uniform format, making it difficult to distinguish sections. In some cases, headings are merged with body text, losing their structural significance.

Alternatives

Improve OCR post-processing to detect font size, bold text, or other distinguishing factors to identify headings. Implement a rules-based or ML-based approach to better classify headings. Provide an option to output markdown-like or structured text (e.g., # Heading 1, ## Heading 2).

Additional context

If needed, I can provide example PDFs where the issue occurs. Let me know if you need more details!

Mar 02 '25 07:03 Lilyzhangyanlin

olmocr olmocr copied to clipboard

OCR Conversion Fails to Detect Headings Properly

🚀 The feature, motivation and pitch

Alternatives

Additional context

olmocr
olmocr copied to clipboard