docling icon indicating copy to clipboard operation
docling copied to clipboard

Picture Description in Output

Open rhlarora84 opened this issue 9 months ago • 3 comments

Question

I am using PictureDescriptionAPIOption to generate a description of the Image. I want to replace the placeholders for Image in the output (export_to_) with the picture description. Currently there are only 3 ImageRefModel available -> EMBEDDED, PLACEHOLDER, REFERENCED and none of them adds the description of the Image.

Is there a way to do so? I see that they are part of annotations currently.

rhlarora84 avatar Feb 16 '25 15:02 rhlarora84

We are planning to address this with custom serializers for picture items, i.e. some use cases need the description, others the text which was produced by OCR, other the graph data, etc. Some initial work on this should come in the next days.

dolfim-ibm avatar Feb 17 '25 07:02 dolfim-ibm

@dolfim-ibm If i understand correctly, that means that we can combine multiple elements in the output for picture items ? For exemple have: description (given by a VLM) + text by OCR + reference + ... ?

That can be a very good generic solution !

FloMrt avatar Feb 17 '25 08:02 FloMrt

Yes, that is what we would like to allow. It is clear that each use case will need a different output and instead of trying to overload with content we are thinking of making it modular and customizable.

dolfim-ibm avatar Feb 17 '25 09:02 dolfim-ibm

@dolfim-ibm It would be nice if something similar was available for tables. The referenced (picture of table saved), embedded (picture embedded), extracted (OCR pull), placeholder (??). And allowing either ocr or vlm to convert would be nice additions.

Justinius avatar Mar 28 '25 03:03 Justinius