docling icon indicating copy to clipboard operation
docling copied to clipboard

Support non-standard headings for word

Open Manuel030 opened this issue 1 year ago • 3 comments

Requested feature

For documents created in a non-english version of Word, the headings style name will differ from Heading. I.e. in the case of German this is the default: Überschrift 1. I understand that it is not feasible to support all different versions of Word. Hence, it would make sense to allow users to share a config with the MsWordDocumentBackend.

Manuel030 avatar Dec 02 '24 15:12 Manuel030

Hi @Manuel030, yes, this is indeed a problem with MS Office formats we are aware of. Let us have an iteration on this topic to see if we can find a scalable solution where users are not required to provide extra configuration.

cau-git avatar Dec 03 '24 15:12 cau-git

@Manuel030 @Manuel030 I'm looking for language agnostic solution @Manuel030, any chance you could share with us an example document with header created in MS Word with German localization (i.e. with Überschrift 1 style instead of Heading 1) Would help to debug such cases, thx!

maxmnemonic avatar Dec 06 '24 09:12 maxmnemonic

@Manuel030, @cau-git, I rewired label detection logic to use style_id instead of style name, this should make it MS Word localization agnostic: https://github.com/DS4SD/docling/pull/534

maxmnemonic avatar Dec 06 '24 13:12 maxmnemonic

Hello My Italian documents loose headings and boldness when I try to exports from docx to markdown. Same documents in PDF format was converted correctly. Why this happening and how can I fix It?

Thank you

acirasa avatar Mar 17 '25 19:03 acirasa