PaddleX
PaddleX copied to clipboard
How to distiguish between Heading hiarchy?
I use PP-DocLayout-L for PDF analysis.
Detected labels:
- image
- text
- text
- text
- text
- text
- text
- text
- text
- text
- text
- doc_title
- paragraph_title // H1
- text
- paragraph_title // This should be H2
- text
- figure_title
- paragraph_title
- text
- text
- text
As you can see, there is no distinction between H1, H2, and H3. Is it possible to make this more specific? I need a heading hierarchy.
Hello, currently, recognizing the hierarchical relationships of titles is not supported, but this issue can be resolved through post-processing. At present, PP-StructureV3 offers a similar post-processing solution. By the way, we actually recommend PP-DocLayout_plus, which is an upgraded version of PP-DocLayout. It can recognize a wider variety of document types and offers higher accuracy.