docling icon indicating copy to clipboard operation
docling copied to clipboard

Image formats not generating picture descriptions, only OCR text extraction

Open JViktoRArtola opened this issue 2 months ago • 7 comments

When converting images using Docling, the library does not generate picture descriptions for image formats. It only performs OCR text extraction when text is present in the image. The pictures=[] array remains empty for all images, regardless of whether they contain text or not, making it impossible to retrieve any visual content descriptions.

However, if the same image is embedded in a PDF file, Docling correctly generates picture descriptions and populates the pictures array. This inconsistency suggests that the image processing pipeline behaves differently for standalone image formats (PNG, JPG) versus images within PDF documents.

This image features a close-up of an adorable ginger tabby kitten with bright, curious blue eyes. The kitten has a soft, orange-striped coat and is gazing up with an expression of innocence and wonder. The background is softly blurred, drawing attention to the kitten’s sweet and delicate features.

Steps to reproduce

  1. Set up Docling with the following configuration:

    InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    # or InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
    
     def _build_full_pipeline_options(self) -> PdfPipelineOptions:
         """Build the full/accurate PdfPipelineOptions configuration.
    
         Includes OCR, table structure, formula/code enrichment and picture description.
         """
         pipeline_options = PdfPipelineOptions()
         pipeline_options.do_ocr = True
         pipeline_options.do_table_structure = True
         pipeline_options.do_formula_enrichment = True
         pipeline_options.do_code_enrichment = True
         pipeline_options.generate_picture_images = True
         pipeline_options.enable_remote_services = True
         pipeline_options.do_picture_description = True
         pipeline_options.picture_description_options = self._picture_description_options # OpenAI
         return pipeline_options
    
  2. Convert two versions of the same image:

    • Image with text (pussInBoots.png): Contains the text "am Puss in Boots"
    • Image without text (pussInBoots_no_text.png): Same image with text removed
    result = converter.convert(image_file)
    markdown_content = result.document.export_to_markdown(image_mode=ImageRefMode.REFERENCED)
    
  3. Inspect the resulting DoclingDocument objects:

    print(result.document.pictures)  # Returns empty list [] for both images
    print(result.document.texts)     # Returns text only for image with text
    
  4. Convert the same image embedded in a PDF file and observe that picture descriptions are correctly generated.

Expected behavior: The pictures array should contain picture items with descriptions of the visual content (e.g., "A cat wearing boots and a hat") for all images, regardless of whether text is present and regardless of whether the image is standalone or embedded in a PDF.

Actual behavior:

  • For standalone images with text: Only OCR text extraction occurs (texts=['am Puss in Boots']), no picture description generated (pictures=[])
  • For standalone images without text: No content extracted at all (texts=[], pictures=[])
  • For images in PDF files: Picture descriptions are correctly generated and populate the pictures array
  • The image processing pipeline appears to focus exclusively on text extraction for standalone image formats and does not generate visual content descriptions

Example output for image WITH text:

python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' texts=[TextItem(..., orig='am Puss in Boots', text='am Puss in Boots', ...)] pictures=[] # Empty! No image description generated

Example output for image WITHOUT text: python schema_name='DoclingDocument' version='1.7.0' name='pussInBoots_no_text' texts=[] # Empty, as expected pictures=[] # Empty! Should contain image description

Docling version

Docling version: 2.55.1
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.5.0

Python version

Python 3.13.7

Attachments

images.zip

Logs

schema_name='DoclingDocument' version='1.7.0' name='pussInBoots' origin=DocumentOrigin(mimetype='application/pdf', binary_hash=16517824524666051744, filename='pussInBoots.png', uri=None) furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) groups=[] texts=[TextItem(self_ref='#/texts/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[ProvenanceItem(page_no=1, bbox=BoundingBox(l=158.33333333333334, t=1115.6666666666667, r=824.6666666666666, b=1034.6666666666667, coord_origin=<CoordOrigin.BOTTOMLEFT: 'BOTTOMLEFT'>), charspan=(0, 16))], orig='am Puss in Boots', text='am Puss in Boots', formatting=None, hyperlink=None)] pictures=[] tables=[] key_value_items=[] form_items=[] pages={1: PageItem(size=Size(width=1000.0, height=1250.0), image=None, page_no=1)}

JViktoRArtola avatar Oct 11 '25 21:10 JViktoRArtola

Docling currently generates picture descriptions for images embedded in PDFs but not for standalone image formats like PNG or JPG because the image conversion pipeline does not invoke the picture description enrichment step, even when do_picture_description is enabled in the pipeline options. The PDF pipeline orchestrates enrichment steps (including picture descriptions) via its pipeline options, but the image pipeline may ignore or skip this step, leading to the empty pictures array for standalone images. This difference in pipeline behavior is reflected in the codebase, where enrichment logic for images is less comprehensive than for PDFs source.

Recent changes in Docling (see PR #2251) have started to unify enrichment steps across pipelines, introducing a ConvertPipelineOptions class that enables picture description for more formats. However, standalone images still use PDF pipeline options, which may not trigger picture description generation for images unless further configuration or code changes are made.

To enable picture descriptions for standalone images, you may need to adjust the pipeline configuration to use the new ConvertPipelineOptions for InputFormat.IMAGE and ensure the image pipeline invokes the enrichment step for picture descriptions, similar to how it's done for PDFs.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Oct 11 '25 21:10 dosubot[bot]

Hi, any update on this? The issue is still occurring in version 2.61.2. I think dosubot’s previous response was incorrect. Could you clarify whether this is an app limitation or a bug?

JViktoRArtola avatar Nov 11 '25 23:11 JViktoRArtola

This is an application limitation in Docling 2.62.2, not a bug. The codebase routes standalone images (PNG, JPG) through a pipeline that does not invoke the picture description enrichment step, even when do_picture_description is enabled. Only images embedded in PDFs are processed with full enrichment, including AI-generated descriptions. This difference is intentional and stems from how the image and PDF backends and pipelines are designed—standalone images are not currently enriched with visual descriptions by default, while PDFs are "source".

If you need picture descriptions for standalone images, you would need to implement a manual enrichment step or a custom serializer. There are no recent changes or fixes addressing this in the latest versions, and maintainers describe it as a limitation of the current pipeline and backend support "source".

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 11 '25 23:11 dosubot[bot]

I was wondering if this release addresses the previously mentioned application limitation.

Looking at v2.62.0, I noticed the new feature:

Add the Image backend (https://github.com/docling-project/docling/issues/2627) (3495b73)

JViktoRArtola avatar Nov 25 '25 21:11 JViktoRArtola

The new Image backend in v2.62.0 improves raw image handling and multi-frame support, but it does not itself generate picture descriptions for standalone images. Picture description enrichment is handled by the pipeline, not the backend. With the current setup, standalone images (PNG, JPG, etc.) are processed through the same pipeline machinery as PDFs, so if your pipeline options enable do_picture_description and provide valid picture_description_options, descriptions should be generated for images just as for PDFs—there is no code path that skips enrichment for images if configured correctly. However, the backend alone does not guarantee this; actual behavior depends on your pipeline configuration and enrichment model availability. There is limited documentation on this, but the code structure supports this workflow source source.

If you are still not seeing picture descriptions for standalone images, double-check your pipeline options and consider using a pipeline like VlmPipeline with remote service options, as some users have reported success with custom configurations source.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 25 '25 21:11 dosubot[bot]

@dolfim-ibm I hope you’re doing well. I wanted to follow up on my initial question and ask if you could confirm whether the issue I raised should be considered an application limitation or a bug. Your clarification would really help me understand how best to proceed.

JViktoRArtola avatar Nov 25 '25 21:11 JViktoRArtola

@JViktoRArtola the main reason you see this is because "full-page pictures" are mostly classified as background art. The picture description works if the picture is embedded in a natural context where it co-exists with other items in a page layout. Docling models are not trained to detect standalone ("cropped") pictures as picture, it will simply look for layout in there. You should be able to verify this behaviour by putting the cat picture somewhere in the middle of a page with a few items around it (such as paragraphs). It has nothing to do with the actual input format (PDF vs image), it just happens that most PDFs are actual page layouts.

cau-git avatar Dec 01 '25 08:12 cau-git

@cau-git @dolfim-ibm

I suspect there may be an issue with how the application handles InputFormat.IMAGE in relation to the picture-description pipeline. The behavior appears inconsistent: for example, file_example_JPG_1MB.jpg produced no description, while image.png generated a valid description. This inconsistency makes it difficult to predict or control when visual descriptions will be produced. Could you clarify whether ImageFormatOption is expected to consistently honor the picture-description settings, and confirm whether this behavior is a known limitation or should be considered a defect?

file_example_JPG_1MB.JPG 2025-12-05 17:01:55,014 - WARNING - RapidOCR returned empty result!

version=DoclingVersion(docling_version='2.64.0', docling_core_version='2.54.0', docling_ibm_models_version='3.10.3', docling_parse_version='4.7.2', platform_str='Windows-11-10.0.26100-SP0', py_impl_version='cpython-313', py_lang_version='3.13.9') timestamp=None status=<ConversionStatus.SUCCESS: 'success'> errors=[] pages=[Page(page_no=0, size=Size(width=3800.0, height=2534.0), predictions=PagePredictions(layout=LayoutPrediction(clusters=[]), tablestructure=TableStructurePrediction(table_map={}), figures_classification=None, equations_prediction=None, vlm_response=None), assembled=AssembledUnit(elements=[], body=[], headers=[]), parsed_page=None)] timings={} confidence=ConfidenceReport(parse_score=nan, layout_score=nan, table_score=nan, ocr_score=nan, pages=defaultdict(<class 'docling.datamodel.base_models.PageConfidenceScores'>, {0: PageConfidenceScores(parse_score=nan, layout_score=nan, table_score=nan, ocr_score=nan, mean_grade=<QualityGrade.UNSPECIFIED: 'unspecified'>, low_grade=<QualityGrade.UNSPECIFIED: 'unspecified'>, mean_score=nan, low_score=nan)}), mean_grade=<QualityGrade.UNSPECIFIED: 'unspecified'>, low_grade=<QualityGrade.UNSPECIFIED: 'unspecified'>, mean_score=nan, low_score=nan) document=DoclingDocument(schema_name='DoclingDocument', version='1.8.0', name='file_example_JPG_1MB', origin=DocumentOrigin(mimetype='application/pdf', binary_hash=15072439913624609234, filename='file_example_JPG_1MB.jpg', uri=None), furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, meta=None, name='root', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), body=GroupItem(self_ref='#/body', parent=None, children=[], content_layer=<ContentLayer.BODY: 'body'>, meta=None, name='root', label=<GroupLabel.UNSPECIFIED: 'unspecified'>), groups=[], texts=[], pictures=[], tables=[], key_value_items=[], form_items=[], pages={1: PageItem(size=Size(width=3800.0, height=2534.0), image=None, page_no=1)}) input=InputDocument(file=PureWindowsPath('file_example_JPG_1MB.jpg'), document_hash='683a8528125ca09d8314435c051331de2b4c981c756721a2d12c103e8603a1d2', valid=True, backend_options=None, limits=DocumentLimits(max_num_pages=9223372036854775807, max_file_size=9223372036854775807, page_range=(1, 9223372036854775807)), format=<InputFormat.IMAGE: 'image'>, filesize=1042592, page_count=1) assembled=AssembledUnit(elements=[], body=[], headers=[])

Attachments

images (2).zip

JViktoRArtola avatar Dec 05 '25 22:12 JViktoRArtola