docling
docling copied to clipboard
When the file name has Chinese, an error occurred when converting pdf
Bug
When the file name has Chinese, an error occurred when converting pdf. If I changed the name to English, everything is ok.
Steps to reproduce
docling 中文.pdf -vv ERROR:docling.datamodel.document:An unexpected error occurred while opening the document 中文.pdf Traceback (most recent call last):
Docling version
Docling version: 2.21.0 Docling Core version: 2.18.0 Docling IBM Models version: 3.3.1 Docling Parse version: 3.3.0 Python: cpython-311 (3.11.8) Platform: Windows-10-10.0.22631-SP0
Python version
Python 3.11.8
@x1y9 Can you provide an example file so we can reproduce. I want to solve this asap.
rename any pdf file to "中文.pdf", then run docling to convert it.
We cannot reproduce the issue. Could it be something specific to the Windows OS? Can you please share the stackstrace produced?
Check if your full path contains Chinese characters
@x1y9 If you can provide us an example to reproduce, we can fix it. Otherwhise, we need to close this issue. We have indeed tried to rename the file, but no error was seen.
D:\temp>docling 中文.pdf -vv
ERROR:docling.datamodel.document:An unexpected error occurred while opening the document 中文.pdf
Traceback (most recent call last):
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\datamodel\document.py", line 134, in __init__
self._init_doc(backend, path_or_stream)
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\datamodel\document.py", line 183, in _init_doc
self._backend = backend(self, path_or_stream=path_or_stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\backend\docling_parse_v2_backend.py", line 223, in __init__
raise RuntimeError(
RuntimeError: docling-parse v2 could not load document 2dd9ee9b97fbfc7a4277c3911bdf2596c947a7f44f84fe9223a2c1c61dae2093.
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Finished converting document 中文.pdf in 0.83 sec.
WARNING:docling.cli.main:Document C:\Users\stars\AppData\Local\Temp\tmpodinf7ln\中文.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
Traceback (most recent call last):
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\shutil.py", line 632, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。: 'C:\\Users\\stars\\AppData\\Local\\Temp\\tmpodinf7ln\\中文.pdf'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Scripts\docling.exe\__main__.py", line 7, in <module>
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\main.py", line 338, in __call__
raise e
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\main.py", line 321, in __call__
return get_command(self)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\click\core.py", line 1161, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\core.py", line 665, in main
return _main(
^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\core.py", line 197, in _main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\click\core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\click\core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\main.py", line 703, in wrapper
return callback(**use_params)
^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\cli\main.py", line 322, in convert
with tempfile.TemporaryDirectory() as tempdir:
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 943, in __exit__
self.cleanup()
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 947, in cleanup
self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 929, in _rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\shutil.py", line 787, in rmtree
return _rmtree_unsafe(path, onerror)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\shutil.py", line 634, in _rmtree_unsafe
onerror(os.unlink, fullname, sys.exc_info())
File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 893, in onerror
_os.unlink(path)
PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。: 'C:\\Users\\stars\\AppData\\Local\\Temp\\tmpodinf7ln\\中文.pdf'
@x1y9 Thank you so much for the stack trace and pdf. However, this seems a permission error from python to remove tempfiles (on itself a bit weird since docling should not make any files).
I ran docling with code enrichment and got the following result,
docling --from pdf --to html 中文.pdf --enrich-code
full output: 中文.html.zip
My recommendation: please look in the permission settings of your files or the installation of the python code.