docling Apply export_to_markdown to individual document items

Requested feature

Currently, export_to_markdown is a method of DoclingDocument. I.e. I can either convert the entire doc to markdown or individual pages (using the page_no argument).

My application requires to handle DoclingDocument items individually (and convert them to markdown). This is currently not possible and I would have to reimplement a lot of the markdown conversion logic encapsulated in export_to_markdown.

Would it be possible to have an export_to_markdown method that can be used on the item-level? For example:

 elements = []
    for i, (item, _level) in enumerate(document.iterate_items()):
        elements.append(item.export_to_markdown(*args, **kwargs)

That way, it is possible to benefit from the markdown conversion logic you have already implemented.

Feb 14 '25 08:02 simonschoe

@simonschoe We have that partially established through:

PictureItem.export_to_markdown
TableItem.export_to_markdown

We could extend it to other item types, but most of the others need no special treatment, since they are represented as text. Maybe for headings and code blocks one could consider factoring out a similar method as for tables and pictures. Do you want to make a PR proposal for this?

Feb 14 '25 09:02 cau-git

Two thoughts/questions on this:

Is TableItem.export_to_markdown identical to item.export_to_dataframe().to_markdown?
I would argue that the other item types (or rather label types) do need special treatment. For example, currently I am approaching the problem like so. And what would be super convenient is to have an export_to_markdown method for each item type, such that it properly adds markdown syntax to each type while also accounting for the document hierarchy/levels (especially for heading and/or list_items).

for i, (item, _level) in enumerate(document.iterate_items()):
        label = item.label.name
        reformatted_item= {
            "item_id": 1 + i,
            "label": label,
        }
        if label == "PICTURE":
            reformatted_item["text"] = "<--- image --->"
        elif label in ["TABLE", "DOCUMENT_INDEX"]:
            reformatted_item["text"] = item.export_to_dataframe().to_markdown()
        elif label == "TITLE":
            reformatted_item["text"] = "# " +item.text
        elif label == "SECTION_HEADER":
            reformatted_item["text"] = "## " + item.text
        elif label == "FOOTNOTE":
            reformatted_item["text"] = "^ " + item.text
        elif label == "CODE":
            reformatted_item["text"] = (
                f"```{item.code_language if item.code_language != 'unknown' else ''}\n"
                f"{item.text}\n"
                 "```"
            )
        elif label == "FORMULA":
            reformatted_item["text"] = ("$$" + item.text + "$$") if item.text != "" else item.orig
        elif label in ["TEXT", "PARAGRAPH", "CAPTION", "LIST_ITEM"]:
            reformatted_item["text"] = item.text
        elif label == "CHECKBOX_SELECTED":
            reformatted_item["text"] = "- [x] " + item.text
        elif label == "CHECKBOX_UNSELECTED":
            reformatted_item["text"] = "- [ ] " + item.text
        else:
           ...

Feb 19 '25 13:02 simonschoe

@simonschoe With the new INLINE group, we have partially solved this indeed. We will also extend the to_markdown to all DocItems gradually.

Feb 28 '25 11:02 PeterStaar-IBM

@PeterStaar-IBM will the new MarkdownTextSerializer (https://github.com/docling-project/docling-core/blob/b91e6c79b8bc72a6058f31c27bed5a0a60e40bf0/docling_core/experimental/serializer/markdown.py#L66-L120) soon be available as part of export_to_markdown functions for document items other than images or tables? Super looking forward to the feature!

Mar 30 '25 07:03 simonschoe

Hi @simonschoe

[!IMPORTANT]
The new Serialization API is currently in "beta" under docling_core.experimental.serializer, i.e.

the import package WILL change

the API MAY still slightly change

With that in mind, with the new Serialization API you can indeed instantiate the DocSerializer of your preference, e.g. the MarkdownDocSerializer, and call serialize(), passing the specific item you want to serialize (here).

This snippet in the context of the current image-targeted export_markdown() implementation shows a usage example.

Does this answer your question?

Apr 02 '25 18:04 vagenas

@vagenas Exactly what I was looking for thanks for the pointer! Minor comment: not sure why but the serializer serializes & to &.

Apr 03 '25 07:04 simonschoe

Minor comment: not sure why but the serializer serializes & to &.

@simonschoe Markdown can contain HTML, so the Markdown serializer does HTML escaping. FYI Markdown renderers do render & as & (e.g. right here: &).

Apr 03 '25 07:04 vagenas

Closing as meanwhile released with docling-core v2.29.0.

May 21 '25 14:05 vagenas

@vagenas thanks for coming back to this issue! Maybe one final question: can you enlighten me how the INLINE group that @PeterStaar-IBM referee to above improves the serialisation process? Can't figure this out from the source code unfortunately...

May 21 '25 20:05 simonschoe

@simonschoe the "inline" is a type of group in the DoclingDocument that enables us to mix together different types of items (e.g. a bold text, a normal text, a code snippet etc.) and still have them appear (in Markdown, HTML etc) as a coherent piece of text/"paragraph", i.e. without new lines between.

So this inline group can be serialized like this.

May 22 '25 11:05 vagenas

@vagenas got it, thanks for pointing me to the test!

Is this type already in use as part of the PDF pipeline? With the current docling version, I still observe that the formatting attribute is mostly (exclusively?) empty

May 22 '25 18:05 simonschoe