docling.exceptions.ConversionError: File format not allowed

Open avlm opened this issue 8 months ago • 1 comments

Bug

receiving exception when trying to parse html page. I found a related issue #1048 but in my case the request returns text/html; charset=UTF-8 as content type as expected and the exception still occurs.

Steps to reproduce

source_url = "https://hometheaterhifi.com/reviews/headphone-earphone/hifiman-he1000-unveiled-planar-magnetic-headphone-review/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
converter = DocumentConverter()
result = converter.convert(source_url, headers=headers)
result.document.export_to_markdown()

Docling version

Docling version: 2.30.0
Docling Core version: 2.28.0
Docling IBM Models version: 3.4.2
Docling Parse version: 4.0.1
Python: cpython-312 (3.12.7)
Platform: Linux-6.8.0-76060800daily20240311-generic-x86_64-with-glibc2.35

Python version

3.12.7

Traceback:

Input document hifiman-he1000-unveiled-planar-magnetic-headphone-review does not match any allowed format.
Traceback (most recent call last):
  File "/home/antonio/Work/AD/magazine-scrapers/docling_test.py", line 14, in <module>
    result = converter.convert(source_url, headers=headers)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
    return wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 136, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 220, in convert
    return next(all_res)
           ^^^^^^^^^^^^^
  File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 243, in convert_all
    for conv_res in conv_res_iter:
                    ^^^^^^^^^^^^^
  File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 278, in _convert
    for item in map(
                ^^^^
  File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 328, in _process_document
    raise ConversionError(error_message)
docling.exceptions.ConversionError: File format not allowed: hifiman-he1000-unveiled-planar-magnetic-headphone-review

May 06 '25 15:05 avlm

Thanks @avlm for reporting this issue. I can confirm that we can reproduce the error.

In Docling we have a set of tools to detect the format of a file and apply different backend parsers accordingly. Some of the tools are heuristic and may fail, specially if the document originates from an external URL. Storing the content in a file, with the appropriate extension, gives more chances that the format will be correctly identified and the document will be converted. For instance, the following code would successfully run the conversion of the example above:

import requests
from docling.document_converter import DocumentConverter

source_url = "https://hometheaterhifi.com/reviews/headphone-earphone/hifiman-he1000-unveiled-planar-magnetic-headphone-review/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

with open("example.html", "w", encoding="utf-8") as f:
  f.write(res.text)

converter = DocumentConverter()
result = converter.convert("example.html")
result.document.export_to_markdown()

In any case, Docling should also be able to run the conversion by just passing the URL, like in the steps you described. This issue will be fixed in the next PR.

May 28 '25 17:05 ceberam