docling
docling copied to clipboard
Loading Failure of Files with Chinese Filenames
Bug
When trying to load a file with a Chinese-character-containing filename, an error occurs. However, when the filename is changed to "111", it runs normally.
Steps to reproduce
input_doc = Path("E:/PROJECT/CNKI/Papers/一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf")
Docling version
Version: 2.28.0
Python version
python 3.10.16
self._init_doc(backend, path_or_stream)
File "E:\Programs\miniconda\envs\doc\lib\site-packages\docling\datamodel\document.py", line 183, in _init_doc self._backend = backend(self, path_or_stream=path_or_stream) File "E:\Programs\miniconda\envs\doc\lib\site-packages\docling\backend\docling_parse_v4_backend.py", line 152, in init self.dp_doc: PdfDocument = self.parser.load(path_or_stream=self.path_or_stream) File "E:\Programs\miniconda\envs\doc\lib\site-packages\docling_parse\pdf_parser.py", line 477, in load raise RuntimeError(f"Failed to load document with key {key}") RuntimeError: Failed to load document with key key=E:\PROJECT\CNKI\Papers\一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf 2025-03-21 22:46:12,767 - INFO - Going to convert document batch... 2025-03-21 22:46:12,767 - ERROR - An unexpected error occurred: Input document E:\PROJECT\CNKI\Papers\一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf is not valid.
here is the link of file 一种适用于MMC的混合步长电磁暂态仿真方法_林毅.pdf https://github.com/943fansi/huancun/blob/main/%E4%B8%80%E7%A7%8D%E9%80%82%E7%94%A8%E4%BA%8EMMC%E7%9A%84%E6%B7%B7%E5%90%88%E6%AD%A5%E9%95%BF%E7%94%B5%E7%A3%81%E6%9A%82%E6%80%81%E4%BB%BF%E7%9C%9F%E6%96%B9%E6%B3%95_%E6%9E%97%E6%AF%85.pdf
Additional note: I encountered an error when using Path , but it worked properly when using a string to pass the file address.
The text in the red - framed area of the figure was not recognized. Why is that?
Here is the file address of the Markdown file after being converted by Docling: https://github.com/943fansi/huancun/blob/main/%E4%B8%80%E7%A7%8D%E9%80%82%E7%94%A8%E4%BA%8EMMC%E7%9A%84%E6%B7%B7%E5%90%88%E6%AD%A5%E9%95%BF%E7%94%B5%E7%A3%81%E6%9A%82%E6%80%81%E4%BB%BF%E7%9C%9F%E6%96%B9%E6%B3%95_%E6%9E%97%E6%AF%85.pdf.md
I have the same problem. Is it solved now?
We would kindly ask you to decouple the two issues reported here:
- Issue with a filename containing Chinese characters (on Windows)
- Some part of the document not detected
-
@943fansi I think this is not an issue, you are simply seeing the expected default behavior of Docling which is not including page headers and footers in the natural flow of the exported text.
-
This can be enabled with the
included_content_layersoption of the export_to_markdown() method, for examplefrom docling_core.types.doc.document import ContentLayer doc.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})
-
@EddieRin to which one of the issues are you referring to?
We would kindly ask you to decouple the two issues reported here:
Issue with a filename containing Chinese characters (on Windows)
Some part of the document not detected
- @943fansi I think this is not an issue, you are simply seeing the expected default behavior of Docling which is not including page headers and footers in the natural flow of the exported text.
- This can be enabled with the
included_content_layersoption of the export_to_markdown() method, for example from docling_core.types.doc.document import ContentLayer doc.export_to_markdown(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE})@EddieRin to which one of the issues are you referring to?
The first issue about Chinese filename.
I also encountered this issue. When using a Chinese PDF file path, it works fine with a str type, but if I use a Path object, it throws an error saying the PDF is not valid.
We found this issue is related to filenames with unicode characters when using of Windows. The fix actually has to be done in the docling-parse library and validated with the internal dependencies.
Meanwhile, it seems that using the pypdfium2 PDF backend is not affected. You can use it with
docling --pdf-backend=pypdfium2 FILE
We found this issue is related to filenames with unicode characters when using of Windows. The fix actually has to be done in the docling-parse library and validated with the internal dependencies.
Meanwhile, it seems that using the pypdfium2 PDF backend is not affected. You can use it with
docling --pdf-backend=pypdfium2 FILE
How could I config the pdf-backend in python API?
Well, now we fixed it also in the docling-parse backend, so you don't have to change anything.
If you want to try different backends, here are a few examples https://docling-project.github.io/docling/examples/custom_convert/