File format not allowed: file.docx
Bug
Trying to use the docling DocumentConverter as in the simple conversion example on a .docx file gives the error docling.exceptions.ConversionError: File format not allowed: file.docx, but from the docling documentation docx should be supported. The files were created in a sharepoint drive using the web interface. Expected behaviour is the script running without errors.
Steps to reproduce
Here is some information about the file and then the repro script
➜ file file.docx
file.docx: Microsoft Word 2007+
➜ file --mime-type -b file.docx
application/vnd.openxmlformats-officedocument.wordprocessingml.document
from docling.document_converter import DocumentConverter
source = "file.docx"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Docling version
Docling version: 2.15.1 Docling Core version: 2.14.0 Docling IBM Models version: 3.1.2 Docling Parse version: 3.0.0
Python version
Python 3.12.7
@SebastianCallh This could be related to the findings summarized in https://github.com/DS4SD/docling/issues/802. We have it on the radar.
@cau-git thank you for confirming! I really appreciate your work on docling. Do you have any estimate you can share on when this might be addressed? I am afraid it is a show stopper for us to use docling, but we would really like to.
+1
file xxx.docx Microsoft OOXML
docling.exceptions.ConversionError: File format not allowed: xxx.docx