Issue with Heading Extraction

Open harinisri2001 opened this issue 1 year ago • 5 comments

Hi, Currently, all headings, including subheadings and child headings, are marked with ##, making them indistinguishable from one another. There is no clear differentiation between parent and nested headings.

Anyone else facing this issue?

Dec 06 '24 11:12 harinisri2001

Checked the same document with LlamaParse, and it identifies headers correctly @dolfim-ibm any ideas of how we can improve headers?

Dec 09 '24 17:12 simjak

Same here, I tried to html the .pdf and all headers are identified as h2, plus level is always set to level = 1, so there's no way to easily identify h1, h2, h3...

Dec 12 '24 12:12 jmvial

dupe of #287?

Feb 04 '25 19:02 jkwatson

Is there anyone following up on this issue?

Feb 14 '25 02:02 Daniel-ltw

Would be great if we get a fix around this. Is this in pipeline to be handled @dolfim-ibm ? Thank you!

Feb 20 '25 14:02 nikhildigde

Closing as duplicate of https://github.com/docling-project/docling/issues/287

May 21 '25 08:05 vagenas