docling icon indicating copy to clipboard operation
docling copied to clipboard

KeyError and RuntimeError occurred when opening a document(docx)

Open chaos798 opened this issue 7 months ago • 1 comments

Bug

When using docling to parse a DOCX file, a KeyError and RuntimeError occurred

Steps to reproduce

When using the llama_index.readers.docling package to parse a DOCX file, the aforementioned errors (KeyError and RuntimeError ) occurred.

An unexpected error occurred while opening the document 322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx Traceback (most recent call last): File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 80, in init self.docx_obj = Document(str(self.path_or_stream)) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/api.py", line 27, in Document document_part = cast("DocumentPart", Package.open(docx).main_document_part) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/package.py", line 127, in open pkg_reader = PackageReader.from_file(pkg_file) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 25, in from_file sparts = PackageReader._load_serialized_parts( File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 53, in _load_serialized_parts for partname, blob, reltype, srels in part_walker: File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 86, in _walk_phys_parts for partname, blob, reltype, srels in next_walker: File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 81, in _walk_phys_parts blob = phys_reader.blob_for(partname) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for return self._zipf.read(pack_uri.membername) File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1486, in read with self.open(name, "r", pwd) as fp: File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1525, in open zinfo = self.getinfo(name) File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1452, in getinfo raise KeyError( KeyError: "There is no item named 'NULL' in the archive"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 136, in init self._init_doc(backend, path_or_stream) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 185, in _init_doc self._backend = backend(self, path_or_stream=path_or_stream) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 84, in init raise RuntimeError( RuntimeError: MsPowerpointDocumentBackend could not load document with hash f6b61b74e024c498bc7b3611d4733e00f765bd3d553cd3275952b533596de09b Failed to load file /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx with error: Input document /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx is not valid.. Skipping...

Docling version

docling-2.34.0

Python version

Python 3.10.16

chaos798 avatar May 23 '25 06:05 chaos798

@chaos798 Can you attach the sample document?

PeterStaar-IBM avatar May 26 '25 05:05 PeterStaar-IBM

Hi @PeterStaar-IBM I am having a similar issue. I cannot attach the document for confidentiality, however, the issue seems similar.

docling "file.docx"
ERROR:docling.datamodel.document:An unexpected error occurred while opening the document file.docx
Traceback (most recent call last):
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\backend\msword_backend.py", line 84, in __init__
    self.docx_obj = Document(str(self.path_or_stream))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
                                         ^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\package.py", line 127, in open
    pkg_reader = PackageReader.from_file(pkg_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 25, in from_file
    sparts = PackageReader._load_serialized_parts(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 53, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
                                          ^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 86, in _walk_phys_parts
    for partname, blob, reltype, srels in next_walker:
                                          ^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 81, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\phys_pkg.py", line 83, in blob_for
    return self._zipf.read(pack_uri.membername)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\AppData\Local\Programs\Python\Python312\Lib\zipfile\__init__.py", line 1567, in read
    with self.open(name, "r", pwd) as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\AppData\Local\Programs\Python\Python312\Lib\zipfile\__init__.py", line 1604, in open
    zinfo = self.getinfo(name)
            ^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\AppData\Local\Programs\Python\Python312\Lib\zipfile\__init__.py", line 1532, in getinfo
    raise KeyError(
KeyError: "There is no item named 'customXML/item5.xml' in the archive"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\datamodel\document.py", line 137, in __init__
    self._init_doc(backend, path_or_stream)
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\datamodel\document.py", line 186, in _init_doc
    self._backend = backend(self, path_or_stream=path_or_stream)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\backend\msword_backend.py", line 88, in __init__
    raise RuntimeError(
RuntimeError: MsPowerpointDocumentBackend could not load document with hash [REDACTED]
WARNING:docling.cli.main:Document file.docx failed to convert.

Key line being KeyError: "There is no item named 'customXML/item5.xml' in the archive"

When I extract the docx as an archive, the file contains a number of files under customXML.

Image

item5.xml contains metadata that a document management system has embedded. Here is the contents of that, with a redaction.

<?xml version="1.0" encoding="UTF-8"?><properties xmlns="http://www.imanage.com/work/xmlschema"><documentid>[REDACTED COMPANY INTERNAL DOCUMENT ID]</documentid><senderid>[REDACTED 6 DIGIT NUMBER]</senderid><senderemail>[REDACTED EMAIL]</senderemail><lastmodified>[REDACTED ISO 8601 DATE]</lastmodified><database>[REDACTED TEXT]</database></properties>

Using pipx 1.7.1, Docling 2.36.0

Hope this helps!

fn5 avatar Jun 04 '25 05:06 fn5

[我也遇到了同样的问题,很多docx都报楼主的这个错,例如这个文件

习近平在俄罗斯媒体发表署名文章.docx

pyaaaa avatar Jun 20 '25 06:06 pyaaaa

Hi @chaos798 @PeterStaar-IBM @fn5 @pyaaaa

This is coming from python-docx, for some reason the fix takes time to be done, see here how we patch it : https://github.com/python-openxml/python-docx/issues/1351#issuecomment-2674758280

michaelromagne avatar Jun 21 '25 09:06 michaelromagne

@fn5 looking at your bug, the PR I created will not totally fix your case. But you can still monkey patch the "customXML" case on your side. I asked on python-docx if it is safe to also handle customXML. On your side if you want a quick fix you can do this in your module (adapt "customXML/item5" carefully, maybe replace with "customXML" depending if you have problems with other customXML elements):

from docx.opc.pkgreader import _SerializedRelationships, _SerializedRelationship
from docx.opc.oxml import parse_xml


@staticmethod
def load_from_xml_v2(baseURI, rels_item_xml):
"""Return |_SerializedRelationships| instance loaded with the relationships
contained in `rels_item_xml`.

Returns an empty collection if `rels_item_xml` is |None|.
"""
srels = _SerializedRelationships()
if rels_item_xml is not None:
    rels_elm = parse_xml(rels_item_xml)
    for rel_elm in rels_elm.Relationship_lst:
        # Null target
        if rel_elm.target_ref in ("../NULL", "NULL"):
            continue
        # Internal bookmarks
        if rel_elm.target_ref.startswith("#_") or rel_elm.target_ref.startswith(
            "#"
        ):
            continue
        # CustomXML item5 missing
        if rel_elm.target_ref.startswith("customXML/item5"):
            continue
        srels._srels.append(_SerializedRelationship(baseURI, rel_elm))
return srels


_SerializedRelationships.load_from_xml = load_from_xml_v2

michaelromagne avatar Jun 22 '25 08:06 michaelromagne