docling
docling copied to clipboard
feat: Enable markdown text formatting for docx
Hi,
This PR adds markdown text formatting for docx documents (italic, bold, underline and hyperlinks). I included a new tests/data/docx/unit_test_formatting.docx document to illustrate it. Using the latest docling main the output of export_to_markdown is:
italic bold underline hyperlink italic and bold hyperlink italic bold underline and hyperlink on the same line
with this PR it becomes:
italic bold underline hyperlink italic and bold hyperlink italic bold underline and hyperlink on the same line
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
π΄ Require two reviewer for test updates
This rule is failing.
When test data is updated, we require two reviewers
- [ ]
#approved-reviews-by >= 2
π’ Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
Note: for underline I used the <u> / </u> tags that are not rendered on GitHub π
@maxmnemonic @PeterStaar-IBM do you need any additional info for this PR ?
@SimJeg this is an interesting feature, but we should introduce it with an option for enable/disable, because not all output formats will be compatible with markdown styling. There could also be some consideration on whether to propagate text styling in the Docling document format, but the option will be needed.
Hi @dolfim-ibm,
Indeed, a different function should be applied for HTML for instance. I can add an argument to the convert function (e.g. style=[None, "markdown", "htlm"]).
As there are several options to do this and I don't know very well docling API, I'll wait for your confirmation before pushing updates.
@dolfim-ibm any update on it ?
We actually are considering something similar to what you are proposing.
Adding the option for the format at convert time (with default None) is good, but we would like to have them in the PipelineOptions for the MS Word backend, since it will be something specific to it.
We will soon post more details, but the above is the general idea.
@SimJeg We will implement a design as proposed here: https://github.com/DS4SD/docling/discussions/894 Then, this work will be able to make use of it.
Thanks for the update!
Le ven. 7 fΓ©vr. 2025, 16:24, Christoph Auer @.***> a Γ©crit :
@SimJeg https://github.com/SimJeg We will implement a design as proposed here: #894 https://github.com/DS4SD/docling/discussions/894 Then, this work will be able to make use of it.
β Reply to this email directly, view it on GitHub https://github.com/DS4SD/docling/pull/630#issuecomment-2643237825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VNKSZC6VT5UQUL3VM32OTFZJAVCNFSM6AAAAABT4XSVWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBTGIZTOOBSGU . You are receiving this because you were mentioned.Message ID: @.***>
@vagenas Can you have a look here and see how this intersects with our new concept of INLINE groups. I would if we need to extend docling-doc with BOLD, ITALIC, UNDERLINE and STRIPED groups and adapt this PR.
FYI: @cau-git @dolfim-ibm
Hi @SimJeg, with https://github.com/docling-project/docling-core/pull/182 we introduced βas betaβ a Serialization API operating against the DoclingDocument. This also includes formatting.
This test code shows how the various formatting options can be set.
π Can you update your PR so that it sets these formatting options when adding the respective items to the DoclingDocument?
The actual export to the various output formats should not be part of this PR as it will be taken care of by the new Serialization API β e.g. the Markdown export is already using the new API & automatically exports bold, italics, strikethrough, and hyperlinks.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
π’ Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
π’ Require two reviewer for test updates
Wonderful, this rule succeeded.
When test data is updated, we require two reviewers
- [X]
#approved-reviews-by >= 2
Hi @SimJeg, where do you stand on the discussed updates?
To provide some more context on our example snippet:
- we use formatting & hyperlink options to specify how individual items are to be formatted
- we create an inline group to indicate that multiple items (that may be differently formatted) should actually be interpreted as parts of a single "inline" component instead of separate "paragraphs" (details here)
Hope that explains this a bit better.
Looking forward to your updates β would be great to have the DOCX backend updated this week! π
@vagenas currently looking at it. I started by merging the current main and noticed that the following code
from docling.document_converter import DocumentConverter
source = "/path/to/docling/tests/data/docx/unit_test_formatting.docx"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
now returns <u> instead of <u> for underline.
My current proposal is mainly to update the handle_text_elements function with 1 line:
+ text = self.format_paragraph(paragraph)
where format_paragraph handles formatting. The resulting text can then be used by 3 different functions depending on the style (self.add_listitem, doc.add_text or self.add_header).
In docx, a paragraph is a list of different runs, each of which can have a different font (bold, italic and underline). This implies that to use the new Formatting option, my self.format_paragraph function should now return a list of tuple (text, format, hyperlink) that should then be handled by the 3 different function. Is that correct ?
@vagenas I pushed a small update to use the Formatting class instead of my style tuple and handle each hyperlink in a separated "item". It will make easier to return a list of (text, format, hyperlink) if this is the direction you want me to follow.
@SimJeg
now returns <u> instead of for underline
The actual Markdown-specific formatting is performed automatically by export_to_markdown() (i.e. the format_text() in this PR is to be removed π) β all a backend should do is:
- correctly populate each item's
formattingandhyperlinkfields - wrap these items within an inline group wherever needed (e.g. if a paragraph contains a mix of different formats)
my self.format_paragraph function should now return a list of tuple (text, format, hyperlink)
Sounds good, as you'll want to apply the doc.add_* operations not a single text, but rather on the various potential components of the paragraph (as returned by paragraph.iter_inner_content()) β along with their evaluated formatting & hyperlink.
In case your returned list indeed has more than one elements (i.e. multiple "runs" in paragraph), you can then open a new inline group in handle_text_elements() and use that group as parent when adding direct children (text items, lists etc).
@vagenas I removed the format_text method and pushed a version that only handles doc.add_text(label=DocItemLabel.PARAGRAPH, ...) (so it's still wip). Using the same code as above I get:
italic
bold
underline
*italic *
**bold **
underline
and
on the same line
So
export_to_markdowndoes not handle underline (which is ok I guess as there is no standard for it)- there is an issue to have everything on 1 line
Could you check my code and tell me what's wrong / if it's going in the right direction ? If it seems ok for you I guess I'll add a few for text, format, hyperlink in paragraph_elements: loop everywhere it's needed
I tried to replace
for text, format, hyperlink in paragraph_elements:
doc.add_text(
label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text,
formatting=format, hyperlink=hyperlink
)
by
for text, format, hyperlink in paragraph_elements:
inline_fmt = doc.add_group(label=GroupLabel.INLINE, parent=self.parents[level - 1])
doc.add_text(
label=DocItemLabel.TEXT, parent=inline_fmt, text=text,
formatting=format, hyperlink=hyperlink
)
but it had no effect
@SimJeg
- underline is indeed deliberately not considered by the Markdown export.
- to get that you'll need to create an inline group as mentioned above.
E.g. you could do something like this once you have your paragraph_elements:
parent: NodeItem = self.parents.get(self.get_level() - 1)
if len(paragraph_elements) > 1:
parent = doc.add_group(
label=GroupLabel.INLINE, parent=parent,
)
and then pass that parent in your doc.add_text() invocations (please also strip the text there, as I think most Markdown interpreters only apply formatting on strings with no leading/trailing whitespace).
@vagenas what's wrong with the code I shared were I (try to) use inline groups ? (we posted almost simultaneously)
For stripping, my (deleted) format_text made sure the leading and trailing whitespaces were preserved because in word, you can have a text italic bold where the space between "italic" and "bold" can be in italic. If you don't preserve the whitespaces, this would become italicbold without spacing. ~~It seems that export_to_markdown instead insert \n\n between them by default but that's not correct.~~
@SimJeg
what's wrong with the code I shared were I (try to) use inline groups ?
Well, you don't want to have a separate inline group for each paragraph element β instead you want a single inline group for the whole paragraph (in case it comprises more than one elements), so the snippet I shared shall be used right after getting the paragraph_elements (not inside a paragraph_elements for-loop).
For stripping, my (deleted) format_text made sure the leading and trailing whitespaces were preserved because in word, you can have a text italic bold where the space between "italic" and "bold" can be in italic. If you don't preserve the whitespaces, this would become italicbold without spacing. It seems that export_to_markdown instead insert \n\n between them by default but that's not correct.
The exporter will add a single space between inline elements (not \n\n). Formatted spaces may be technically possible in Word, but they appear problematic in Markdown.
Well, you don't want to have a separate inline group for each paragraph element
I fixed it thanks, the parent was indeed not on the right side of the for loop π
I will move forward and update all other doc.add_* using the for loop
@vagenas I now handled lists too and updated tests/data/docx/unit_test_formatting.docx to have associated tests. For title, headers and equations, I did not change anything. The output looks good:
*italic*
**bold**
underline
[hyperlink](https://github.com/DS4SD/docling)
[***italic and bold hyperlink***](https://github.com/DS4SD/docling)
*italic* **bold** underline and [hyperlink](https://github.com/DS4SD/docling) on the same line
- *Italic bullet 1*
- **Bold bullet 2**
- Underline bullet 3
Your review is welcome
@vagenas I also added 2 lines for a missing feature: handle headers and footers in MS word document (see #632) . I added the header of the first section and footer of the last section and updated tests/data/docx/unit_test_formatting.docx to include a header and footer
A better implementation would be to handle all sections properly but the following code did not work (I did not look deeply into walk_linear however).
for section in self.docx_obj.sections:
doc = self.walk_linear(section.header._element, self.docx_obj, doc)
for e in section.iter_inner_content():
doc = self.walk_linear(e._element, self.docx_obj, doc) # does not add anything
doc = self.walk_linear(section.footer._element, self.docx_obj, doc)
@vagenas any feedback ? could you run the tests ? Would be great to merge today if possible
Thanks for the valuable input @rateixei β formatting in special case of equations to be addressed in follow-up iteration.
Thanks for this nice contribution @SimJeg! π