python-docx2txt
python-docx2txt copied to clipboard
docx2text - unwrapping zip - fails and crashes
File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 284, in load
docs += self.process_pages(
File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 439, in process_pages
doc = self.process_page(
File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 479, in process_page
attachment_texts = self.process_attachment(page["id"], ocr_languages)
File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 557, in process_attachment
text = title + self.process_doc(absolute_url)
File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 657, in process_doc
return docx2txt.process(file_data)
File "/Users/user/project/.venv/lib/python3.10/site-packages/docx2txt/docx2txt.py", line 88, in process
text += xml2text(zipf.read(doc_xml))
File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1464, in read
with self.open(name, "r", pwd) as fp:
File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1503, in open
zinfo = self.getinfo(name)
File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1430, in getinfo
raise KeyError(
KeyError: "There is no item named 'word/document.xml' in the archive"
This should probably be wrapped as such:
if doc_xml not in zipf.namelist():
# Handle the missing file - skip it/soft error/etc
else:
text += xml2text(zipf.read(doc_xml))
...
This fixes it:
...
# get main text
doc_xml = 'word/document.xml'
if doc_xml not in zipf.namelist():
print(f"{doc_xml} not in ZIP Namelist")
else:
text += xml2text(zipf.read(doc_xml))
...