Failed to convert text that appears in automatic numbering format in .docx
Hi,
When trying to convert this .docx document
https://www.3gpp.org/ftp/tsg_ran/WG1_RL1/TSGR1_120b/Docs//R1-2501739.zip
to .md, the result is that the dockling failed to convert the words that are in automatic numbering format: Proposals 1,2,3,4 and Observations 1,2,3. All the rest text converts properly.
Original .docx:
Output .md
Any ideas how to handle these automatic numbering key words that are in .docx documents?
Environment
Python 3.10.11 Docling 2.32.1
hello @alexshmmy I checked your document and I am still a bit unsure if those are macros which we are talking about because in theory .docx files cannot have macros..
My best guess over here is the observations and the proposal text is in some kind of text box which is not getting parsed by the docling code .Can you try SmolDocling once to see if any difference has been made? I think you are using any of ocrs right now.
Here is a thread for my macros cannot be in a docx claim https://answers.microsoft.com/en-us/msoffice/forum/all/i-can-use-macros-in-docx-so-the-definition-of-docm/92ed9d8f-bf77-4ba5-baad-032f5546eba5
@ShiroYasha18, I tried with SmolDocling, the results is same. I do not use any ocrs, just default conversion docling input.docx (neither docling --no-ocr input.docx worked) Those keywords are lost during conversion!
It seems indeed, that even if they look macros, they are actually automatic numbering words. I have now changed the title to reflect the error better. But the point is that they do not convert to markdown properly neither with Docling nor with SmolDocling. And they appear a lot in standardisation documents.
The only way it could work is to convert to .pdf with MS office or Libreoffice and then with docling, but this is not scalable, as those documents are millions and we are searching for a scalable solution that those key words are hold during direct conversion with Docling.
Any other idea would be appreciated :)
oh I thought earlier that it was due to the text box parsing as I think docling cannot parse the text in elements like textbox which has to be fixed with another XML patch but I think that is a story for later. Thank you for identifying that they are automatic numbering words @alexshmmy . I think if they are automatic numbering words then mostly it won't be parsed in the docx format as already observed by your experiments. I think the only fix is to pre process the docx a bit to flatten these auto numbering words and then process that to docx which is preprocessed and these auto numbering words which appear a lot in standardisation docs gets appended to the actual text. After this it should work fine even for docx conversion via current docling pipeline. The thing is if we want to integrate a preprocessing feature for .docx I would need permissions of the docling team as there would be an added step for the docling-core repo to modify the flow a bit.
As you mentioned using SmolDocling also did not change things soo clearly we first need to flatten out everything to normal text then we can proceed as usual.
Let me know if we can open a PR for the same :)
Thank you @ShiroYasha18 ! It would be great if we can address this, it is million standardisation documents with such automatic numbering issue.
@maxmnemonic @PeterStaar-IBM Would it be possible to help @ShiroYasha18 to submit PR to Docling-core repo? Thank you very much in advance.
Hi, any news on it? This is very important feature and still unresolved!