PaddleX icon indicating copy to clipboard operation
PaddleX copied to clipboard

How to distiguish between Heading hiarchy?

Open Samyssmile opened this issue 5 months ago • 1 comments

I use PP-DocLayout-L for PDF analysis.

Detected labels:

  • image
  • text
  • text
  • text
  • text
  • text
  • text
  • text
  • text
  • text
  • text
  • doc_title
  • paragraph_title // H1
  • text
  • paragraph_title // This should be H2
  • text
  • figure_title
  • paragraph_title
  • text
  • text
  • text

As you can see, there is no distinction between H1, H2, and H3. Is it possible to make this more specific? I need a heading hierarchy.

Samyssmile avatar Jun 14 '25 15:06 Samyssmile

Hello, currently, recognizing the hierarchical relationships of titles is not supported, but this issue can be resolved through post-processing. At present, PP-StructureV3 offers a similar post-processing solution. By the way, we actually recommend PP-DocLayout_plus, which is an upgraded version of PP-DocLayout. It can recognize a wider variety of document types and offers higher accuracy.

cuicheng01 avatar Jun 15 '25 08:06 cuicheng01