docling icon indicating copy to clipboard operation
docling copied to clipboard

Convert raising 'xlsx' type is not supported.

Open MatheusAbdias opened this issue 10 months ago • 2 comments

Bug

The _guess_format method in _DocumentConversionInput class is incorrectly identifying XLSX files as "application/zip" format.

Steps to Reproduce

  1. Create an instance of DoclingConvert with XLSX support:
converter = DoclingConvert(allowed_formats=[InputFormat.XLSX])
result = converter.convert(path)
result.document.save_as_markdown(Path("./output.md"))

The _guess_format method in _DocumentConversionInput is returning "application/zip"

class _DocumentConversionInput):
    def _guess_format(self, obj: Path | DocumentStream) -> InputFormat | None: ...

Inside the _guess_format filetype.guess_mime is returning application/zip. ...

Docling version

ocling version: 2.24.0 Docling Core version: 2.20.0 Docling IBM Models version: 3.4.0 Docling Parse version: 3.4.0 Python: cpython-312 (3.12.8) Platform: Linux-6.6.75-2-MANJARO-x86_64-with-glibc2.41 ...

Python version

Python 3.12.8 ...

MatheusAbdias avatar Feb 22 '25 15:02 MatheusAbdias

While investigating this issue, I found the python-magic library which correctly identifies the XLSX file type. I've tested it with the same file and it returns the proper mime type. However, this solution requires installing libmagic as a system dependency. Would it be acceptable to add this dependency to the project?

MatheusAbdias avatar Feb 23 '25 23:02 MatheusAbdias

@MatheusAbdias Thanks for your suggestion. This comes back to https://github.com/DS4SD/docling/issues/802, where we track this issue more broadly.

We want to avoid working with libmagic since it is under GPL-license, hence we cannot distribute it. We try to avoid system library dependencies that we cannot bundle.

cau-git avatar Feb 25 '25 13:02 cau-git

Hi @MatheusAbdias , i also got the same error with an xlsx file, although all of the other xlsx files that I tried were parsed normally.

.venv/lib/python3.12/site-packages/docling/document_converter.py", line 328, in _process_document
   raise ConversionError(error_message)
docling.exceptions.ConversionError: File format not allowed: 7829e919_bfb7_4e40_a641_78f303f2638f.xlsx

ta0a2000t avatar May 08 '25 23:05 ta0a2000t