docling.exceptions.ConversionError: File format not allowed
Bug
receiving exception when trying to parse html page. I found a related issue #1048 but in my case the request returns text/html; charset=UTF-8 as content type as expected and the exception still occurs.
Steps to reproduce
source_url = "https://hometheaterhifi.com/reviews/headphone-earphone/hifiman-he1000-unveiled-planar-magnetic-headphone-review/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
converter = DocumentConverter()
result = converter.convert(source_url, headers=headers)
result.document.export_to_markdown()
Docling version
Docling version: 2.30.0
Docling Core version: 2.28.0
Docling IBM Models version: 3.4.2
Docling Parse version: 4.0.1
Python: cpython-312 (3.12.7)
Platform: Linux-6.8.0-76060800daily20240311-generic-x86_64-with-glibc2.35
Python version
3.12.7
Traceback:
Input document hifiman-he1000-unveiled-planar-magnetic-headphone-review does not match any allowed format.
Traceback (most recent call last):
File "/home/antonio/Work/AD/magazine-scrapers/docling_test.py", line 14, in <module>
result = converter.convert(source_url, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
return wrapper(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/pydantic/_internal/_validate_call.py", line 136, in __call__
res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 220, in convert
return next(all_res)
^^^^^^^^^^^^^
File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 243, in convert_all
for conv_res in conv_res_iter:
^^^^^^^^^^^^^
File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 278, in _convert
for item in map(
^^^^
File "/home/antonio/.pyenv/versions/magazine-scrapers/lib/python3.12/site-packages/docling/document_converter.py", line 328, in _process_document
raise ConversionError(error_message)
docling.exceptions.ConversionError: File format not allowed: hifiman-he1000-unveiled-planar-magnetic-headphone-review
Thanks @avlm for reporting this issue. I can confirm that we can reproduce the error.
In Docling we have a set of tools to detect the format of a file and apply different backend parsers accordingly. Some of the tools are heuristic and may fail, specially if the document originates from an external URL. Storing the content in a file, with the appropriate extension, gives more chances that the format will be correctly identified and the document will be converted. For instance, the following code would successfully run the conversion of the example above:
import requests
from docling.document_converter import DocumentConverter
source_url = "https://hometheaterhifi.com/reviews/headphone-earphone/hifiman-he1000-unveiled-planar-magnetic-headphone-review/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
with open("example.html", "w", encoding="utf-8") as f:
f.write(res.text)
converter = DocumentConverter()
result = converter.convert("example.html")
result.document.export_to_markdown()
In any case, Docling should also be able to run the conversion by just passing the URL, like in the steps you described. This issue will be fixed in the next PR.