docling
docling copied to clipboard
Export to markdown only contains H2 headers
Bug
I tried loading a pdf file with multiple headings / sections. But seems like docling always extracts it to markdown with H2 (##) only. Am I doing something wrong here? I have tried with multiple PDFs.
...
Steps to reproduce
import logging import time from pathlib import Path
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem from docling.datamodel.base_models import FigureElement, InputFormat, Table from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions, TesseractOcrOptions from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(name)
IMAGE_RESOLUTION_SCALE = 2.0
def main(): logging.basicConfig(level=logging.INFO)
input_doc_path = Path("/Users/nikhildi/Downloads/solution.pdf")
output_dir = Path("scratch")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng"])
pipeline_options.generate_picture_images = False
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
start_time = time.time()
conv_res = doc_converter.convert(input_doc_path)
md_filename = output_dir / f"test.md"
print(conv_res.document.save_as_markdown(filename= md_filename, image_placeholder=""))
...
Docling version
docling 2.15.1 docling-core 2.15.1 docling-ibm-models 3.2.1 docling-parse 3.1.1 ...
Python version
Python 3.11.11 ...
@nikhildigde Yes, this is known for now. Basically, we need to infer the table-of-contents in order to get the right level of the headers. For the moment, the header level can be inferred for docx, html and md, but not yet for pdf. After we refactor the reading-order model, this is the next issue we want to handle.
@PeterStaar-IBM thank you for the response and explanation. Sorry to ask, but do you have any ETA for this fix? Also, if there is no table of contents will this not work? I thought it would need some model training to get this right?
@nikhildigde As soon as we can, we will probably start with an approximate solution and then gradually improve it.
Ok. Thank you for the great work. Appreciate it!
Just listing some of the previous tickets that mention the same issue. #529 #652
is there any update regarding this topic?
Just to add, another way to approach this could be using font size. Headings would generally have a higher font size than the body text.
I ran a test doc and it seems, Docling currently only extracts space_width for characters not font size (which can be a close proxy as well)
While waiting for this improvement, is there a simple way to (re)define the headings if the table of contents is determined elsewhere, for example using internal links?
Been looking at latest commits from this project and it seems like docx have been the focus?
@PeterStaar-IBM Any progress on this?
@Daniel-ltw We have been working on the docling-parse to propagate the TOC info if it is present in the pdf. This is the first step towards a general solution.
This is a quite important feature if, for instance, you want to develop smarter chunking.
@PeterStaar-IBM Is there a first step solution for pdf without a TOC?
In our team, just an end-user not docling related, we've been using some heuristic based postprocessing where we look for patterns like.
- Header 1.1. subheader 1.1.1. subsubheader
Works pretty well if your document is formatted like that. We also first looked for "Chapter x" type headers and "indent" everything that isn't a chapter header already one heading level down
Edit: clarified that this solution is not related to any docling developments
@Vinno97 Is this in the current version or is this all coming in the next version? Or is this post processing done after you get docling to convert it into markdown?
Sorry I should've phrased my comment better. This is some postprocessing my team has been doing as an end user of docling for our own documents. I'm not in any way related to the Docling team.
Edited my comment for future reference as well
@PeterStaar-IBM - Do you have any updates on when the basic version would be available?
waiting eagerly for this update as well ! It's a super critical feature for any official document to be processed
I just finished the evaluation code (so we can finally measure performance). We will start now working on the identification.
Hi @PeterStaar-IBM , is there an estimated time when it would be ready & pushed to the repo? Really appreciate your efforts!
I ran a test doc and it seems, Docling currently only extracts
space_widthfor characters not font size (which can be a close proxy as well)
I can't find space_width anywhere, how did you obtain it?
I am facing the same problem for the vlm_pipeline I am selecting different models using the vlm_model_spec, but all the models are marking headings as h2 and even missing table sections on certain pages. I am using the example code from the docling documentation:
from docling.pipeline.vlm_pipeline import VlmPipelineOptions
from docling.datamodel import vlm_model_specs
pipeline_options = VlmPipelineOptions(
vllm_options = vlm_model_specs.GRANITE_VISION_OLLAMA
)
Hi @PeterStaar-IBM - Apologies for nagging, but any news on this feature release?
In our team, just an end-user not docling related, we've been using some heuristic based postprocessing where we look for patterns like.
- Header 1.1. subheader 1.1.1. subsubheader
Works pretty well if your document is formatted like that. We also first looked for "Chapter x" type headers and "indent" everything that isn't a chapter header already one heading level down
Edit: clarified that this solution is not related to any docling developments
@Vinno97 , This will work but this also has to work for,
- headers that has both numbered header and non numbered header.
- Headers numbers starting from any number and not always 1. Many more cases needs to be handled, If there is an way like in HTML( where header and subheaders are easily identified based on font) would be great.
@PeterStaar-IBM is there any way to contribute to this feature? I understand it's complicated and I'm willing to help with any bits where possible.
Hi all, (specifically @nikhildigde @HRV1527 @MernaSenger @Daniel-ltw),
I have been working on this topic. I'm hoping to open a docling-PR on this topic, but due to the complexity of integrating the solution I decided to make a little package docling-hierarchical-pdf, that is tailored exactly to work with Docling and adds inference of PDF hierarhies - it works with scanned PDFs as well as text-based PDFs and has only very few extra dependencies, additional to docling.
It is still in an early stage, but please do give it a try and give feedback. I ran a bunch of tests and was happy with the performance. The limitation, if any, seems to be more on the docling document parsing side. The package attempts to read the TOC from PDF metadata as well as inferring document hierarchy based on header numbering and header styles and font size.
Please try it out and report bugs related to header hierarchy parsing here. I promise I will create PR for docling using the insights and code in my package.
@krrome that's gr8 news. Thanks for the efforts. Will try it out asap and give feedback. God speed