docling icon indicating copy to clipboard operation
docling copied to clipboard

Some content being left behind during file conversion, with both standard and vlm pipelines.

Open bancroftway opened this issue 5 months ago • 7 comments

Thank you so much for this wonderful library!!

I wanted to report a slight issue, where certain information in a pdf is not being converted, in both standard and vlm pipelines. I am attching the original pdf and the markdown output. I have highlighted in red color the text that is not getting included in the output.

Image

NYSample.pdf NYSample.pdf.md

bancroftway avatar Jul 09 '25 14:07 bancroftway

Thanks for the detailed report and attachments! This is a known challenge with both the standard and VLM pipelines, especially for tables and complex layouts. In the standard pipeline, missing content often results from segmentation or model limitations—tables or certain text blocks might not be detected if the layout or table structure models don't recognize them, or if fallback OCR doesn't trigger. In the VLM pipeline, the output depends entirely on what the vision-language model returns; if it skips content, Docling can't recover it downstream. Features like table-mode or enrichment options are only available in the standard pipeline, not VLM, and remote models like Ollama may not fully support table extraction or markdown formatting as expected [source].

If the missing content is in images or small graphical elements, you can try enabling OCR and setting min_picture_page_surface_ratio=0 in your pipeline options, and ensure an OCR engine (like EasyOCR or Tesseract) is installed [source].

For best results with tables in the VLM pipeline, use a VLM server that implements the OpenAI API format (like vllm or llama-server), as these are more compatible with Docling's remote model support [source].

If you have a specific example of missing content (e.g., a table or a block of text), sharing the details can help pinpoint whether it's a model limitation or a configuration tweak that might help. Complex layouts and unusual PDF encoding remain challenging, but the team is actively improving coverage and error reporting.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Jul 09 '25 14:07 dosubot[bot]

@bancroftway I just checked the output with our default pipeline. Here is what I see:

  • "County of Steuben Town of Woodhull" is identified as a page header. It ends up on the converted document but will be suppressed in Markdown / HTML by default, because page headers are usually not desired to be in the plain output. You can export the document including page headers by enabling the FURNITURE content layer. @dosu please illustrate how that is done in an example code.
  • "PARCEL INFORMATION 468200 522.00-06-269.100 Tower Site" is present in the regular output, but at the bottom. The reading-order erroneously put this as the last element.

cau-git avatar Jul 10 '25 12:07 cau-git

To include page headers (like "County of Steuben Town of Woodhull") in your Markdown or HTML export, use the Python API and specify both BODY and FURNITURE content layers. Here’s a minimal example:

# Convert your document as usual, or load an already saved JSON from disk.
source = "/path/to/your/file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document

# Export to Markdown including page headers (FURNITURE)
md = doc.export_to_markdown(
    included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}
)
with open("NYSample_with_headers.md", "w") as f:
    f.write(md)

# Or export to HTML with headers
html = doc.export_to_html(
    included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}
)
with open("NYSample_with_headers.html", "w") as f:
    f.write(html)

This option is available via the Python API, not the CLI. By default, only BODY is included, so you must explicitly add FURNITURE to see headers and footers in your output. For more details, see the relevant code and parameters in the Docling source.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Jul 10 '25 13:07 dosubot[bot]

@dosu what will be the steps to accomplish the above when using docling-serve?

bancroftway avatar Jul 10 '25 14:07 bancroftway

Currently, docling-serve does not support including custom content layers like FURNITURE (for page headers/footers) in exports via its API endpoints. This option is only available in the Python API, where you can specify included_content_layers. The docling-serve API and its request models do not expose a way to control which content layers are included—exports always use the default (BODY only) and there’s no parameter to change this in the API or request payloads [source].

If you need to include page headers in your output, you’ll need to use the Python API for now. Supporting this in docling-serve would require a code change to pass an included_content_layers option from the API through to the export logic [source].

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Jul 10 '25 14:07 dosubot[bot]

"PARCEL INFORMATION 468200 522.00-06-269.100 Tower Site" is present in the regular output, but at the bottom. The reading-order erroneously put this as the last element.

I also see this behaviour, where information at the top of the page is placed at the bottom of the output when exported to markdown/text. See attached invoice pdf.

In my case, the markdown from docling is sent to another model that returns a structured output that needs to contain a supplier name. The model fails to find the correct supplier name because it gets appended to the delivery address by docling.

uzerp-invoice-obf.pdf

Observed when using docling package version 2.55.1.

steveblamey avatar Oct 15 '25 13:10 steveblamey

Testing today with docling 2.60.1 and this issue is resolved for me.

steveblamey avatar Nov 17 '25 13:11 steveblamey