docling icon indicating copy to clipboard operation
docling copied to clipboard

When the file name has Chinese, an error occurred when converting pdf

Open x1y9 opened this issue 9 months ago • 5 comments
trafficstars

Bug

When the file name has Chinese, an error occurred when converting pdf. If I changed the name to English, everything is ok.

Steps to reproduce

docling 中文.pdf -vv ERROR:docling.datamodel.document:An unexpected error occurred while opening the document 中文.pdf Traceback (most recent call last):

Docling version

Docling version: 2.21.0 Docling Core version: 2.18.0 Docling IBM Models version: 3.3.1 Docling Parse version: 3.3.0 Python: cpython-311 (3.11.8) Platform: Windows-10-10.0.22631-SP0

Python version

Python 3.11.8

x1y9 avatar Feb 11 '25 03:02 x1y9

@x1y9 Can you provide an example file so we can reproduce. I want to solve this asap.

PeterStaar-IBM avatar Feb 11 '25 05:02 PeterStaar-IBM

rename any pdf file to "中文.pdf", then run docling to convert it.

x1y9 avatar Feb 11 '25 07:02 x1y9

We cannot reproduce the issue. Could it be something specific to the Windows OS? Can you please share the stackstrace produced?

dolfim-ibm avatar Feb 11 '25 07:02 dolfim-ibm

Check if your full path contains Chinese characters

happyTonakai avatar Feb 12 '25 05:02 happyTonakai

@x1y9 If you can provide us an example to reproduce, we can fix it. Otherwhise, we need to close this issue. We have indeed tried to rename the file, but no error was seen.

PeterStaar-IBM avatar Feb 13 '25 06:02 PeterStaar-IBM

D:\temp>docling 中文.pdf -vv
ERROR:docling.datamodel.document:An unexpected error occurred while opening the document 中文.pdf
Traceback (most recent call last):
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\datamodel\document.py", line 134, in __init__
    self._init_doc(backend, path_or_stream)
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\datamodel\document.py", line 183, in _init_doc
    self._backend = backend(self, path_or_stream=path_or_stream)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\backend\docling_parse_v2_backend.py", line 223, in __init__
    raise RuntimeError(
RuntimeError: docling-parse v2 could not load document 2dd9ee9b97fbfc7a4277c3911bdf2596c947a7f44f84fe9223a2c1c61dae2093.
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Finished converting document 中文.pdf in 0.83 sec.
WARNING:docling.cli.main:Document C:\Users\stars\AppData\Local\Temp\tmpodinf7ln\中文.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
Traceback (most recent call last):
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\shutil.py", line 632, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。: 'C:\\Users\\stars\\AppData\\Local\\Temp\\tmpodinf7ln\\中文.pdf'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Scripts\docling.exe\__main__.py", line 7, in <module>
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\main.py", line 338, in __call__
    raise e
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\click\core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\core.py", line 665, in main
    return _main(
           ^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\core.py", line 197, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\click\core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\click\core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\typer\main.py", line 703, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\site-packages\docling\cli\main.py", line 322, in convert
    with tempfile.TemporaryDirectory() as tempdir:
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 943, in __exit__
    self.cleanup()
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 947, in cleanup
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 929, in _rmtree
    _shutil.rmtree(name, onerror=onerror)
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\shutil.py", line 787, in rmtree
    return _rmtree_unsafe(path, onerror)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\shutil.py", line 634, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\stars\WPy64-31180\python-3.11.8.amd64\Lib\tempfile.py", line 893, in onerror
    _os.unlink(path)
PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。: 'C:\\Users\\stars\\AppData\\Local\\Temp\\tmpodinf7ln\\中文.pdf'

中文.pdf

x1y9 avatar Feb 27 '25 03:02 x1y9

@x1y9 Thank you so much for the stack trace and pdf. However, this seems a permission error from python to remove tempfiles (on itself a bit weird since docling should not make any files).

I ran docling with code enrichment and got the following result,

docling --from pdf --to html 中文.pdf --enrich-code
Image

full output: 中文.html.zip

My recommendation: please look in the permission settings of your files or the installation of the python code.

PeterStaar-IBM avatar Feb 27 '25 11:02 PeterStaar-IBM