docling icon indicating copy to clipboard operation
docling copied to clipboard

Export to markdown only contains H2 headers

Open nikhildigde opened this issue 9 months ago • 4 comments

Bug

I tried loading a pdf file with multiple headings / sections. But seems like docling always extracts it to markdown with H2 (##) only. Am I doing something wrong here? I have tried with multiple PDFs.

docling_test.pdf

...

Steps to reproduce

import logging import time from pathlib import Path

from docling_core.types.doc import ImageRefMode, PictureItem, TableItem from docling.datamodel.base_models import FigureElement, InputFormat, Table from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions, TesseractOcrOptions from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(name)

IMAGE_RESOLUTION_SCALE = 2.0

def main(): logging.basicConfig(level=logging.INFO)

input_doc_path = Path("/Users/nikhildi/Downloads/solution.pdf")

output_dir = Path("scratch")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng"])
pipeline_options.generate_picture_images = False

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

start_time = time.time()

conv_res = doc_converter.convert(input_doc_path)
md_filename = output_dir / f"test.md"

print(conv_res.document.save_as_markdown(filename= md_filename, image_placeholder=""))

...

Docling version

docling 2.15.1 docling-core 2.15.1 docling-ibm-models 3.2.1 docling-parse 3.1.1 ...

Python version

Python 3.11.11 ...

nikhildigde avatar Feb 19 '25 16:02 nikhildigde

@nikhildigde Yes, this is known for now. Basically, we need to infer the table-of-contents in order to get the right level of the headers. For the moment, the header level can be inferred for docx, html and md, but not yet for pdf. After we refactor the reading-order model, this is the next issue we want to handle.

PeterStaar-IBM avatar Feb 21 '25 06:02 PeterStaar-IBM

@PeterStaar-IBM thank you for the response and explanation. Sorry to ask, but do you have any ETA for this fix? Also, if there is no table of contents will this not work? I thought it would need some model training to get this right?

nikhildigde avatar Feb 21 '25 06:02 nikhildigde

@nikhildigde As soon as we can, we will probably start with an approximate solution and then gradually improve it.

PeterStaar-IBM avatar Feb 21 '25 07:02 PeterStaar-IBM

Ok. Thank you for the great work. Appreciate it!

nikhildigde avatar Feb 21 '25 07:02 nikhildigde

Just listing some of the previous tickets that mention the same issue. #529 #652

Daniel-ltw avatar Feb 28 '25 02:02 Daniel-ltw

is there any update regarding this topic?

myteberib avatar Mar 19 '25 13:03 myteberib

Just to add, another way to approach this could be using font size. Headings would generally have a higher font size than the body text.

I ran a test doc and it seems, Docling currently only extracts space_width for characters not font size (which can be a close proxy as well)

SAURABH-CARDANO avatar Apr 02 '25 15:04 SAURABH-CARDANO

While waiting for this improvement, is there a simple way to (re)define the headings if the table of contents is determined elsewhere, for example using internal links?

blenzi avatar Apr 07 '25 19:04 blenzi

Been looking at latest commits from this project and it seems like docx have been the focus?

@PeterStaar-IBM Any progress on this?

Daniel-ltw avatar Apr 10 '25 03:04 Daniel-ltw

@Daniel-ltw We have been working on the docling-parse to propagate the TOC info if it is present in the pdf. This is the first step towards a general solution.

PeterStaar-IBM avatar Apr 10 '25 04:04 PeterStaar-IBM

This is a quite important feature if, for instance, you want to develop smarter chunking.

acarv avatar Apr 10 '25 16:04 acarv

@PeterStaar-IBM Is there a first step solution for pdf without a TOC?

Daniel-ltw avatar Apr 10 '25 20:04 Daniel-ltw

In our team, just an end-user not docling related, we've been using some heuristic based postprocessing where we look for patterns like.

  1. Header 1.1. subheader 1.1.1. subsubheader

Works pretty well if your document is formatted like that. We also first looked for "Chapter x" type headers and "indent" everything that isn't a chapter header already one heading level down

Edit: clarified that this solution is not related to any docling developments

Vinno97 avatar Apr 12 '25 21:04 Vinno97

@Vinno97 Is this in the current version or is this all coming in the next version? Or is this post processing done after you get docling to convert it into markdown?

Daniel-ltw avatar Apr 13 '25 21:04 Daniel-ltw

Sorry I should've phrased my comment better. This is some postprocessing my team has been doing as an end user of docling for our own documents. I'm not in any way related to the Docling team.

Edited my comment for future reference as well

Vinno97 avatar Apr 14 '25 10:04 Vinno97

@PeterStaar-IBM - Do you have any updates on when the basic version would be available?

nikhildigde avatar May 06 '25 07:05 nikhildigde

waiting eagerly for this update as well ! It's a super critical feature for any official document to be processed

manishdash12 avatar May 07 '25 17:05 manishdash12

I just finished the evaluation code (so we can finally measure performance). We will start now working on the identification.

PeterStaar-IBM avatar May 08 '25 05:05 PeterStaar-IBM

Hi @PeterStaar-IBM , is there an estimated time when it would be ready & pushed to the repo? Really appreciate your efforts!

MernaSenger avatar Jun 04 '25 13:06 MernaSenger

I ran a test doc and it seems, Docling currently only extracts space_width for characters not font size (which can be a close proxy as well)

I can't find space_width anywhere, how did you obtain it?

copernico avatar Jun 18 '25 15:06 copernico

I am facing the same problem for the vlm_pipeline I am selecting different models using the vlm_model_spec, but all the models are marking headings as h2 and even missing table sections on certain pages. I am using the example code from the docling documentation:

from docling.pipeline.vlm_pipeline import VlmPipelineOptions
from docling.datamodel import vlm_model_specs

pipeline_options = VlmPipelineOptions(
    vllm_options = vlm_model_specs.GRANITE_VISION_OLLAMA
)

HRV1527 avatar Jul 11 '25 14:07 HRV1527

Hi @PeterStaar-IBM - Apologies for nagging, but any news on this feature release?

nikhildigde avatar Jul 19 '25 09:07 nikhildigde

In our team, just an end-user not docling related, we've been using some heuristic based postprocessing where we look for patterns like.

  1. Header 1.1. subheader 1.1.1. subsubheader

Works pretty well if your document is formatted like that. We also first looked for "Chapter x" type headers and "indent" everything that isn't a chapter header already one heading level down

Edit: clarified that this solution is not related to any docling developments

@Vinno97 , This will work but this also has to work for,

  1. headers that has both numbered header and non numbered header.
  2. Headers numbers starting from any number and not always 1. Many more cases needs to be handled, If there is an way like in HTML( where header and subheaders are easily identified based on font) would be great.

swaroopgv avatar Aug 01 '25 19:08 swaroopgv

@PeterStaar-IBM is there any way to contribute to this feature? I understand it's complicated and I'm willing to help with any bits where possible.

nikhildigde avatar Aug 26 '25 04:08 nikhildigde

Hi all, (specifically @nikhildigde @HRV1527 @MernaSenger @Daniel-ltw),

I have been working on this topic. I'm hoping to open a docling-PR on this topic, but due to the complexity of integrating the solution I decided to make a little package docling-hierarchical-pdf, that is tailored exactly to work with Docling and adds inference of PDF hierarhies - it works with scanned PDFs as well as text-based PDFs and has only very few extra dependencies, additional to docling.

It is still in an early stage, but please do give it a try and give feedback. I ran a bunch of tests and was happy with the performance. The limitation, if any, seems to be more on the docling document parsing side. The package attempts to read the TOC from PDF metadata as well as inferring document hierarchy based on header numbering and header styles and font size.

Please try it out and report bugs related to header hierarchy parsing here. I promise I will create PR for docling using the insights and code in my package.

krrome avatar Oct 08 '25 17:10 krrome

@krrome that's gr8 news. Thanks for the efforts. Will try it out asap and give feedback. God speed

nikhildigde avatar Oct 08 '25 17:10 nikhildigde