KeyError and RuntimeError occurred when opening a document(docx)
Bug
When using docling to parse a DOCX file, a KeyError and RuntimeError occurred
Steps to reproduce
When using the llama_index.readers.docling package to parse a DOCX file, the aforementioned errors (KeyError and RuntimeError ) occurred.
An unexpected error occurred while opening the document 322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx Traceback (most recent call last): File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 80, in init self.docx_obj = Document(str(self.path_or_stream)) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/api.py", line 27, in Document document_part = cast("DocumentPart", Package.open(docx).main_document_part) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/package.py", line 127, in open pkg_reader = PackageReader.from_file(pkg_file) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 25, in from_file sparts = PackageReader._load_serialized_parts( File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 53, in _load_serialized_parts for partname, blob, reltype, srels in part_walker: File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 86, in _walk_phys_parts for partname, blob, reltype, srels in next_walker: File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 81, in _walk_phys_parts blob = phys_reader.blob_for(partname) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for return self._zipf.read(pack_uri.membername) File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1486, in read with self.open(name, "r", pwd) as fp: File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1525, in open zinfo = self.getinfo(name) File "/home/hao/miniconda3/envs/10/lib/python3.10/zipfile.py", line 1452, in getinfo raise KeyError( KeyError: "There is no item named 'NULL' in the archive"
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 136, in init self._init_doc(backend, path_or_stream) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/datamodel/document.py", line 185, in _init_doc self._backend = backend(self, path_or_stream=path_or_stream) File "/home/hao/miniconda3/envs/10/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 84, in init raise RuntimeError( RuntimeError: MsPowerpointDocumentBackend could not load document with hash f6b61b74e024c498bc7b3611d4733e00f765bd3d553cd3275952b533596de09b Failed to load file /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx with error: Input document /home/hao/python_projects/xmbank/compa/../demoData/人行资料/322--中国人民银行关于发布存款保险费率管理和保费核定办法(试行)的通知(银发2016177号).docx is not valid.. Skipping...
Docling version
docling-2.34.0
Python version
Python 3.10.16
@chaos798 Can you attach the sample document?
Hi @PeterStaar-IBM I am having a similar issue. I cannot attach the document for confidentiality, however, the issue seems similar.
docling "file.docx"
ERROR:docling.datamodel.document:An unexpected error occurred while opening the document file.docx
Traceback (most recent call last):
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\backend\msword_backend.py", line 84, in __init__
self.docx_obj = Document(str(self.path_or_stream))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\api.py", line 27, in Document
document_part = cast("DocumentPart", Package.open(docx).main_document_part)
^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\package.py", line 127, in open
pkg_reader = PackageReader.from_file(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 25, in from_file
sparts = PackageReader._load_serialized_parts(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 53, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 86, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\pkgreader.py", line 81, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docx\opc\phys_pkg.py", line 83, in blob_for
return self._zipf.read(pack_uri.membername)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\AppData\Local\Programs\Python\Python312\Lib\zipfile\__init__.py", line 1567, in read
with self.open(name, "r", pwd) as fp:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\AppData\Local\Programs\Python\Python312\Lib\zipfile\__init__.py", line 1604, in open
zinfo = self.getinfo(name)
^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\AppData\Local\Programs\Python\Python312\Lib\zipfile\__init__.py", line 1532, in getinfo
raise KeyError(
KeyError: "There is no item named 'customXML/item5.xml' in the archive"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\datamodel\document.py", line 137, in __init__
self._init_doc(backend, path_or_stream)
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\datamodel\document.py", line 186, in _init_doc
self._backend = backend(self, path_or_stream=path_or_stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\jwebster\pipx\.cache\b595bda9ec033c2\Lib\site-packages\docling\backend\msword_backend.py", line 88, in __init__
raise RuntimeError(
RuntimeError: MsPowerpointDocumentBackend could not load document with hash [REDACTED]
WARNING:docling.cli.main:Document file.docx failed to convert.
Key line being KeyError: "There is no item named 'customXML/item5.xml' in the archive"
When I extract the docx as an archive, the file contains a number of files under customXML.
item5.xml contains metadata that a document management system has embedded. Here is the contents of that, with a redaction.
<?xml version="1.0" encoding="UTF-8"?><properties xmlns="http://www.imanage.com/work/xmlschema"><documentid>[REDACTED COMPANY INTERNAL DOCUMENT ID]</documentid><senderid>[REDACTED 6 DIGIT NUMBER]</senderid><senderemail>[REDACTED EMAIL]</senderemail><lastmodified>[REDACTED ISO 8601 DATE]</lastmodified><database>[REDACTED TEXT]</database></properties>
Using pipx 1.7.1, Docling 2.36.0
Hope this helps!
Hi @chaos798 @PeterStaar-IBM @fn5 @pyaaaa
This is coming from python-docx, for some reason the fix takes time to be done, see here how we patch it : https://github.com/python-openxml/python-docx/issues/1351#issuecomment-2674758280
@fn5 looking at your bug, the PR I created will not totally fix your case. But you can still monkey patch the "customXML" case on your side. I asked on python-docx if it is safe to also handle customXML. On your side if you want a quick fix you can do this in your module (adapt "customXML/item5" carefully, maybe replace with "customXML" depending if you have problems with other customXML elements):
from docx.opc.pkgreader import _SerializedRelationships, _SerializedRelationship
from docx.opc.oxml import parse_xml
@staticmethod
def load_from_xml_v2(baseURI, rels_item_xml):
"""Return |_SerializedRelationships| instance loaded with the relationships
contained in `rels_item_xml`.
Returns an empty collection if `rels_item_xml` is |None|.
"""
srels = _SerializedRelationships()
if rels_item_xml is not None:
rels_elm = parse_xml(rels_item_xml)
for rel_elm in rels_elm.Relationship_lst:
# Null target
if rel_elm.target_ref in ("../NULL", "NULL"):
continue
# Internal bookmarks
if rel_elm.target_ref.startswith("#_") or rel_elm.target_ref.startswith(
"#"
):
continue
# CustomXML item5 missing
if rel_elm.target_ref.startswith("customXML/item5"):
continue
srels._srels.append(_SerializedRelationship(baseURI, rel_elm))
return srels
_SerializedRelationships.load_from_xml = load_from_xml_v2