docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: Enable markdown text formatting for docx

Open SimJeg opened this issue 11 months ago β€’ 9 comments

Hi,

This PR adds markdown text formatting for docx documents (italic, bold, underline and hyperlinks). I included a new tests/data/docx/unit_test_formatting.docx document to illustrate it. Using the latest docling main the output of export_to_markdown is:

italic bold underline hyperlink italic and bold hyperlink italic bold underline and hyperlink on the same line

with this PR it becomes:

italic bold underline hyperlink italic and bold hyperlink italic bold underline and hyperlink on the same line

SimJeg avatar Dec 19 '24 11:12 SimJeg

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

πŸ”΄ Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟒 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Dec 19 '24 11:12 mergify[bot]

Note: for underline I used the <u> / </u> tags that are not rendered on GitHub πŸ˜…

SimJeg avatar Dec 19 '24 14:12 SimJeg

@maxmnemonic @PeterStaar-IBM do you need any additional info for this PR ?

SimJeg avatar Dec 26 '24 09:12 SimJeg

@SimJeg this is an interesting feature, but we should introduce it with an option for enable/disable, because not all output formats will be compatible with markdown styling. There could also be some consideration on whether to propagate text styling in the Docling document format, but the option will be needed.

dolfim-ibm avatar Jan 06 '25 08:01 dolfim-ibm

Hi @dolfim-ibm,

Indeed, a different function should be applied for HTML for instance. I can add an argument to the convert function (e.g. style=[None, "markdown", "htlm"]).

As there are several options to do this and I don't know very well docling API, I'll wait for your confirmation before pushing updates.

SimJeg avatar Jan 06 '25 09:01 SimJeg

@dolfim-ibm any update on it ?

SimJeg avatar Jan 13 '25 14:01 SimJeg

We actually are considering something similar to what you are proposing.

Adding the option for the format at convert time (with default None) is good, but we would like to have them in the PipelineOptions for the MS Word backend, since it will be something specific to it.

We will soon post more details, but the above is the general idea.

dolfim-ibm avatar Jan 14 '25 17:01 dolfim-ibm

@SimJeg We will implement a design as proposed here: https://github.com/DS4SD/docling/discussions/894 Then, this work will be able to make use of it.

cau-git avatar Feb 07 '25 15:02 cau-git

Thanks for the update!

Le ven. 7 fΓ©vr. 2025, 16:24, Christoph Auer @.***> a Γ©crit :

@SimJeg https://github.com/SimJeg We will implement a design as proposed here: #894 https://github.com/DS4SD/docling/discussions/894 Then, this work will be able to make use of it.

β€” Reply to this email directly, view it on GitHub https://github.com/DS4SD/docling/pull/630#issuecomment-2643237825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VNKSZC6VT5UQUL3VM32OTFZJAVCNFSM6AAAAABT4XSVWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBTGIZTOOBSGU . You are receiving this because you were mentioned.Message ID: @.***>

SimJeg avatar Feb 07 '25 16:02 SimJeg

@vagenas Can you have a look here and see how this intersects with our new concept of INLINE groups. I would if we need to extend docling-doc with BOLD, ITALIC, UNDERLINE and STRIPED groups and adapt this PR.

FYI: @cau-git @dolfim-ibm

PeterStaar-IBM avatar Feb 27 '25 13:02 PeterStaar-IBM

Hi @SimJeg, with https://github.com/docling-project/docling-core/pull/182 we introduced β€”as betaβ€” a Serialization API operating against the DoclingDocument. This also includes formatting.

This test code shows how the various formatting options can be set.

πŸ‘‰ Can you update your PR so that it sets these formatting options when adding the respective items to the DoclingDocument?

The actual export to the various output formats should not be part of this PR as it will be taken care of by the new Serialization API β€” e.g. the Markdown export is already using the new API & automatically exports bold, italics, strikethrough, and hyperlinks.

vagenas avatar Mar 17 '25 16:03 vagenas

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟒 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟒 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • [X] #approved-reviews-by >= 2

mergify[bot] avatar Mar 31 '25 08:03 mergify[bot]

Hi @SimJeg, where do you stand on the discussed updates?

To provide some more context on our example snippet:

  • we use formatting & hyperlink options to specify how individual items are to be formatted
  • we create an inline group to indicate that multiple items (that may be differently formatted) should actually be interpreted as parts of a single "inline" component instead of separate "paragraphs" (details here)

Hope that explains this a bit better.

Looking forward to your updates β€” would be great to have the DOCX backend updated this week! πŸ™Œ

vagenas avatar Mar 31 '25 09:03 vagenas

@vagenas currently looking at it. I started by merging the current main and noticed that the following code

from docling.document_converter import DocumentConverter

source = "/path/to/docling/tests/data/docx/unit_test_formatting.docx"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

now returns &lt;u&gt; instead of <u> for underline.

SimJeg avatar Mar 31 '25 09:03 SimJeg

My current proposal is mainly to update the handle_text_elements function with 1 line:

+ text = self.format_paragraph(paragraph)

where format_paragraph handles formatting. The resulting text can then be used by 3 different functions depending on the style (self.add_listitem, doc.add_text or self.add_header).

In docx, a paragraph is a list of different runs, each of which can have a different font (bold, italic and underline). This implies that to use the new Formatting option, my self.format_paragraph function should now return a list of tuple (text, format, hyperlink) that should then be handled by the 3 different function. Is that correct ?

SimJeg avatar Mar 31 '25 09:03 SimJeg

@vagenas I pushed a small update to use the Formatting class instead of my style tuple and handle each hyperlink in a separated "item". It will make easier to return a list of (text, format, hyperlink) if this is the direction you want me to follow.

SimJeg avatar Mar 31 '25 10:03 SimJeg

@SimJeg

now returns <u> instead of for underline

The actual Markdown-specific formatting is performed automatically by export_to_markdown() (i.e. the format_text() in this PR is to be removed πŸ˜‰) β€” all a backend should do is:

  • correctly populate each item's formatting and hyperlink fields
  • wrap these items within an inline group wherever needed (e.g. if a paragraph contains a mix of different formats)

my self.format_paragraph function should now return a list of tuple (text, format, hyperlink)

Sounds good, as you'll want to apply the doc.add_* operations not a single text, but rather on the various potential components of the paragraph (as returned by paragraph.iter_inner_content()) β€” along with their evaluated formatting & hyperlink.

In case your returned list indeed has more than one elements (i.e. multiple "runs" in paragraph), you can then open a new inline group in handle_text_elements() and use that group as parent when adding direct children (text items, lists etc).

vagenas avatar Mar 31 '25 11:03 vagenas

@vagenas I removed the format_text method and pushed a version that only handles doc.add_text(label=DocItemLabel.PARAGRAPH, ...) (so it's still wip). Using the same code as above I get:

italic

bold

underline

hyperlink

italic and bold hyperlink

*italic *

**bold **

underline

and

hyperlink

on the same line

So

  1. export_to_markdown does not handle underline (which is ok I guess as there is no standard for it)
  2. there is an issue to have everything on 1 line

Could you check my code and tell me what's wrong / if it's going in the right direction ? If it seems ok for you I guess I'll add a few for text, format, hyperlink in paragraph_elements: loop everywhere it's needed

SimJeg avatar Mar 31 '25 12:03 SimJeg

I tried to replace

            for text, format, hyperlink in paragraph_elements:
                doc.add_text(
                    label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text,
                    formatting=format, hyperlink=hyperlink                    
                )

by

            for text, format, hyperlink in paragraph_elements:
                inline_fmt = doc.add_group(label=GroupLabel.INLINE, parent=self.parents[level - 1])
                doc.add_text(
                    label=DocItemLabel.TEXT, parent=inline_fmt, text=text,
                    formatting=format, hyperlink=hyperlink                    
                )

but it had no effect

SimJeg avatar Mar 31 '25 13:03 SimJeg

@SimJeg

  1. underline is indeed deliberately not considered by the Markdown export.
  2. to get that you'll need to create an inline group as mentioned above.

E.g. you could do something like this once you have your paragraph_elements:

parent: NodeItem = self.parents.get(self.get_level() - 1)
if len(paragraph_elements) > 1:
    parent = doc.add_group(
        label=GroupLabel.INLINE, parent=parent,
    )

and then pass that parent in your doc.add_text() invocations (please also strip the text there, as I think most Markdown interpreters only apply formatting on strings with no leading/trailing whitespace).

vagenas avatar Mar 31 '25 13:03 vagenas

@vagenas what's wrong with the code I shared were I (try to) use inline groups ? (we posted almost simultaneously)

For stripping, my (deleted) format_text made sure the leading and trailing whitespaces were preserved because in word, you can have a text italic bold where the space between "italic" and "bold" can be in italic. If you don't preserve the whitespaces, this would become italicbold without spacing. ~~It seems that export_to_markdown instead insert \n\n between them by default but that's not correct.~~

SimJeg avatar Mar 31 '25 13:03 SimJeg

@SimJeg

what's wrong with the code I shared were I (try to) use inline groups ?

Well, you don't want to have a separate inline group for each paragraph element β€” instead you want a single inline group for the whole paragraph (in case it comprises more than one elements), so the snippet I shared shall be used right after getting the paragraph_elements (not inside a paragraph_elements for-loop).

For stripping, my (deleted) format_text made sure the leading and trailing whitespaces were preserved because in word, you can have a text italic bold where the space between "italic" and "bold" can be in italic. If you don't preserve the whitespaces, this would become italicbold without spacing. It seems that export_to_markdown instead insert \n\n between them by default but that's not correct.

The exporter will add a single space between inline elements (not \n\n). Formatted spaces may be technically possible in Word, but they appear problematic in Markdown.

vagenas avatar Mar 31 '25 13:03 vagenas

Well, you don't want to have a separate inline group for each paragraph element

I fixed it thanks, the parent was indeed not on the right side of the for loop πŸ˜… I will move forward and update all other doc.add_* using the for loop

SimJeg avatar Mar 31 '25 13:03 SimJeg

@vagenas I now handled lists too and updated tests/data/docx/unit_test_formatting.docx to have associated tests. For title, headers and equations, I did not change anything. The output looks good:

*italic*

**bold**

underline

[hyperlink](https://github.com/DS4SD/docling)

[***italic and bold hyperlink***](https://github.com/DS4SD/docling)

*italic* **bold** underline and [hyperlink](https://github.com/DS4SD/docling) on the same line

- *Italic bullet 1*
- **Bold bullet 2**
- Underline bullet 3

Your review is welcome

SimJeg avatar Mar 31 '25 14:03 SimJeg

@vagenas I also added 2 lines for a missing feature: handle headers and footers in MS word document (see #632) . I added the header of the first section and footer of the last section and updated tests/data/docx/unit_test_formatting.docx to include a header and footer

A better implementation would be to handle all sections properly but the following code did not work (I did not look deeply into walk_linear however).

            for section in self.docx_obj.sections:
                doc = self.walk_linear(section.header._element, self.docx_obj, doc)
                for e in section.iter_inner_content():
                    doc = self.walk_linear(e._element, self.docx_obj, doc) # does not add anything
                doc = self.walk_linear(section.footer._element, self.docx_obj, doc)

SimJeg avatar Apr 01 '25 09:04 SimJeg

@vagenas any feedback ? could you run the tests ? Would be great to merge today if possible

SimJeg avatar Apr 02 '25 06:04 SimJeg

Thanks for the valuable input @rateixei β€” formatting in special case of equations to be addressed in follow-up iteration.

vagenas avatar Apr 03 '25 12:04 vagenas

Thanks for this nice contribution @SimJeg! πŸ™Œ

vagenas avatar Apr 03 '25 13:04 vagenas