Apply export_to_markdown to individual document items
Requested feature
Currently, export_to_markdown is a method of DoclingDocument. I.e. I can either convert the entire doc to markdown or individual pages (using the page_no argument).
My application requires to handle DoclingDocument items individually (and convert them to markdown). This is currently not possible and I would have to reimplement a lot of the markdown conversion logic encapsulated in export_to_markdown.
Would it be possible to have an export_to_markdown method that can be used on the item-level? For example:
elements = []
for i, (item, _level) in enumerate(document.iterate_items()):
elements.append(item.export_to_markdown(*args, **kwargs)
That way, it is possible to benefit from the markdown conversion logic you have already implemented.
@simonschoe We have that partially established through:
PictureItem.export_to_markdownTableItem.export_to_markdown
We could extend it to other item types, but most of the others need no special treatment, since they are represented as text. Maybe for headings and code blocks one could consider factoring out a similar method as for tables and pictures. Do you want to make a PR proposal for this?
Two thoughts/questions on this:
- Is
TableItem.export_to_markdownidentical toitem.export_to_dataframe().to_markdown? - I would argue that the other item types (or rather label types) do need special treatment. For example, currently I am approaching the problem like so. And what would be super convenient is to have an
export_to_markdownmethod for each item type, such that it properly adds markdown syntax to each type while also accounting for the document hierarchy/levels (especially for heading and/or list_items).
for i, (item, _level) in enumerate(document.iterate_items()):
label = item.label.name
reformatted_item= {
"item_id": 1 + i,
"label": label,
}
if label == "PICTURE":
reformatted_item["text"] = "<--- image --->"
elif label in ["TABLE", "DOCUMENT_INDEX"]:
reformatted_item["text"] = item.export_to_dataframe().to_markdown()
elif label == "TITLE":
reformatted_item["text"] = "# " +item.text
elif label == "SECTION_HEADER":
reformatted_item["text"] = "## " + item.text
elif label == "FOOTNOTE":
reformatted_item["text"] = "^ " + item.text
elif label == "CODE":
reformatted_item["text"] = (
f"```{item.code_language if item.code_language != 'unknown' else ''}\n"
f"{item.text}\n"
"```"
)
elif label == "FORMULA":
reformatted_item["text"] = ("$$" + item.text + "$$") if item.text != "" else item.orig
elif label in ["TEXT", "PARAGRAPH", "CAPTION", "LIST_ITEM"]:
reformatted_item["text"] = item.text
elif label == "CHECKBOX_SELECTED":
reformatted_item["text"] = "- [x] " + item.text
elif label == "CHECKBOX_UNSELECTED":
reformatted_item["text"] = "- [ ] " + item.text
else:
...
@simonschoe With the new INLINE group, we have partially solved this indeed. We will also extend the to_markdown to all DocItems gradually.
@PeterStaar-IBM will the new MarkdownTextSerializer (https://github.com/docling-project/docling-core/blob/b91e6c79b8bc72a6058f31c27bed5a0a60e40bf0/docling_core/experimental/serializer/markdown.py#L66-L120) soon be available as part of export_to_markdown functions for document items other than images or tables? Super looking forward to the feature!
Hi @simonschoe
[!IMPORTANT]
The new Serialization API is currently in "beta" underdocling_core.experimental.serializer, i.e.
- the import package WILL change
- the API MAY still slightly change
With that in mind, with the new Serialization API you can indeed instantiate the DocSerializer of your preference, e.g. the MarkdownDocSerializer, and call serialize(), passing the specific item you want to serialize (here).
This snippet in the context of the current image-targeted export_markdown() implementation shows a usage example.
Does this answer your question?
@vagenas Exactly what I was looking for thanks for the pointer! Minor comment: not sure why but the serializer serializes & to &.
Minor comment: not sure why but the serializer serializes
&to&.
@simonschoe Markdown can contain HTML, so the Markdown serializer does HTML escaping. FYI Markdown renderers do render & as & (e.g. right here: &).
Closing as meanwhile released with docling-core v2.29.0.
@vagenas thanks for coming back to this issue! Maybe one final question: can you enlighten me how the INLINE group that @PeterStaar-IBM referee to above improves the serialisation process? Can't figure this out from the source code unfortunately...
@simonschoe the "inline" is a type of group in the DoclingDocument that enables us to mix together different types of items (e.g. a bold text, a normal text, a code snippet etc.) and still have them appear (in Markdown, HTML etc) as a coherent piece of text/"paragraph", i.e. without new lines between.
So this inline group can be serialized like this.
@vagenas got it, thanks for pointing me to the test!
Is this type already in use as part of the PDF pipeline? With the current docling version, I still observe that the formatting attribute is mostly (exclusively?) empty