docling icon indicating copy to clipboard operation
docling copied to clipboard

Issue with Heading Extraction

Open harinisri2001 opened this issue 1 year ago • 5 comments

Hi, Currently, all headings, including subheadings and child headings, are marked with ##, making them indistinguishable from one another. There is no clear differentiation between parent and nested headings.

Anyone else facing this issue?

harinisri2001 avatar Dec 06 '24 11:12 harinisri2001

Checked the same document with LlamaParse, and it identifies headers correctly @dolfim-ibm any ideas of how we can improve headers?

simjak avatar Dec 09 '24 17:12 simjak

Same here, I tried to html the .pdf and all headers are identified as h2, plus level is always set to level = 1, so there's no way to easily identify h1, h2, h3...

jmvial avatar Dec 12 '24 12:12 jmvial

dupe of #287?

jkwatson avatar Feb 04 '25 19:02 jkwatson

Is there anyone following up on this issue?

Daniel-ltw avatar Feb 14 '25 02:02 Daniel-ltw

Would be great if we get a fix around this. Is this in pipeline to be handled @dolfim-ibm ? Thank you!

nikhildigde avatar Feb 20 '25 14:02 nikhildigde

Closing as duplicate of https://github.com/docling-project/docling/issues/287

vagenas avatar May 21 '25 08:05 vagenas