docling icon indicating copy to clipboard operation
docling copied to clipboard

Loading Failure of Files with Chinese Filenames

Open 943fansi opened this issue 8 months ago • 9 comments
trafficstars

Bug

When trying to load a file with a Chinese-character-containing filename, an error occurs. However, when the filename is changed to "111", it runs normally.

Steps to reproduce

input_doc = Path("E:/PROJECT/CNKI/Papers/一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf")

Docling version

Version: 2.28.0

Python version

python 3.10.16

self._init_doc(backend, path_or_stream)

File "E:\Programs\miniconda\envs\doc\lib\site-packages\docling\datamodel\document.py", line 183, in _init_doc self._backend = backend(self, path_or_stream=path_or_stream) File "E:\Programs\miniconda\envs\doc\lib\site-packages\docling\backend\docling_parse_v4_backend.py", line 152, in init self.dp_doc: PdfDocument = self.parser.load(path_or_stream=self.path_or_stream) File "E:\Programs\miniconda\envs\doc\lib\site-packages\docling_parse\pdf_parser.py", line 477, in load raise RuntimeError(f"Failed to load document with key {key}") RuntimeError: Failed to load document with key key=E:\PROJECT\CNKI\Papers\一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf 2025-03-21 22:46:12,767 - INFO - Going to convert document batch... 2025-03-21 22:46:12,767 - ERROR - An unexpected error occurred: Input document E:\PROJECT\CNKI\Papers\一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf is not valid.

943fansi avatar Mar 21 '25 15:03 943fansi

here is the link of file 一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf https://github.com/943fansi/huancun/blob/main/%E4%B8%80%E7%A7%8D%E9%80%82%E7%94%A8%E4%BA%8EMMC%E7%9A%84%E6%B7%B7%E5%90%88%E6%AD%A5%E9%95%BF%E7%94%B5%E7%A3%81%E6%9A%82%E6%80%81%E4%BB%BF%E7%9C%9F%E6%96%B9%E6%B3%95_%E6%9E%97%E6%AF%85.pdf

943fansi avatar Mar 21 '25 15:03 943fansi

Additional note: I encountered an error when using Path , but it worked properly when using a string to pass the file address.

943fansi avatar Mar 21 '25 15:03 943fansi

The text in the red - framed area of the figure was not recognized. Why is that? Image

943fansi avatar Mar 21 '25 17:03 943fansi

Here is the file address of the Markdown file after being converted by Docling: https://github.com/943fansi/huancun/blob/main/%E4%B8%80%E7%A7%8D%E9%80%82%E7%94%A8%E4%BA%8EMMC%E7%9A%84%E6%B7%B7%E5%90%88%E6%AD%A5%E9%95%BF%E7%94%B5%E7%A3%81%E6%9A%82%E6%80%81%E4%BB%BF%E7%9C%9F%E6%96%B9%E6%B3%95_%E6%9E%97%E6%AF%85.pdf.md

943fansi avatar Mar 21 '25 17:03 943fansi

I have the same problem. Is it solved now?

EddieRin avatar May 15 '25 03:05 EddieRin

We would kindly ask you to decouple the two issues reported here:

  1. Issue with a filename containing Chinese characters (on Windows)
  2. Some part of the document not detected
    • @943fansi I think this is not an issue, you are simply seeing the expected default behavior of Docling which is not including page headers and footers in the natural flow of the exported text.

    • This can be enabled with the included_content_layers option of the export_to_markdown() method, for example

      from docling_core.types.doc.document import ContentLayer
      doc.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})
      

@EddieRin to which one of the issues are you referring to?

dolfim-ibm avatar May 15 '25 06:05 dolfim-ibm

We would kindly ask you to decouple the two issues reported here:

  1. Issue with a filename containing Chinese characters (on Windows)

  2. Some part of the document not detected

    • @943fansi I think this is not an issue, you are simply seeing the expected default behavior of Docling which is not including page headers and footers in the natural flow of the exported text.
    • This can be enabled with the included_content_layers option of the export_to_markdown() method, for example from docling_core.types.doc.document import ContentLayer doc.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})

@EddieRin to which one of the issues are you referring to?

The first issue about Chinese filename.

EddieRin avatar May 16 '25 01:05 EddieRin

I also encountered this issue. When using a Chinese PDF file path, it works fine with a str type, but if I use a Path object, it throws an error saying the PDF is not valid.

hzkitty avatar May 16 '25 08:05 hzkitty

We found this issue is related to filenames with unicode characters when using of Windows. The fix actually has to be done in the docling-parse library and validated with the internal dependencies.

Meanwhile, it seems that using the pypdfium2 PDF backend is not affected. You can use it with

docling --pdf-backend=pypdfium2 FILE

dolfim-ibm avatar May 26 '25 11:05 dolfim-ibm

We found this issue is related to filenames with unicode characters when using of Windows. The fix actually has to be done in the docling-parse library and validated with the internal dependencies.

Meanwhile, it seems that using the pypdfium2 PDF backend is not affected. You can use it with

docling --pdf-backend=pypdfium2 FILE

How could I config the pdf-backend in python API?

EddieRin avatar Jun 05 '25 09:06 EddieRin

Well, now we fixed it also in the docling-parse backend, so you don't have to change anything.

If you want to try different backends, here are a few examples https://docling-project.github.io/docling/examples/custom_convert/

dolfim-ibm avatar Jun 05 '25 09:06 dolfim-ibm