olmocr
olmocr copied to clipboard
OCR Conversion Fails to Detect Headings Properly
🚀 The feature, motivation and pitch
When converting PDFs to text using OCR, the tool has difficulty identifying and distinguishing headings, such as H1 (main headings) and H2 (subheadings). This leads to a loss of document structure, making the output less readable and harder to process for downstream applications.
Expected Behavior
The tool should correctly detect and differentiate between various heading levels (e.g., H1, H2, etc.). Headings should be formatted distinctly in the output text (e.g., with larger font sizes, bold text, or prefixed markers).
Current Behavior The OCR does not consistently recognize headings. All text appears in a uniform format, making it difficult to distinguish sections. In some cases, headings are merged with body text, losing their structural significance.
Alternatives
Improve OCR post-processing to detect font size, bold text, or other distinguishing factors to identify headings. Implement a rules-based or ML-based approach to better classify headings. Provide an option to output markdown-like or structured text (e.g., # Heading 1, ## Heading 2).
Additional context
If needed, I can provide example PDFs where the issue occurs. Let me know if you need more details!