python-docx2txt icon indicating copy to clipboard operation
python-docx2txt copied to clipboard

docx2text - unwrapping zip - fails and crashes

Open ventz opened this issue 1 year ago • 1 comments

File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 284, in load
    docs += self.process_pages(
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 439, in process_pages
    doc = self.process_page(
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 479, in process_page
    attachment_texts = self.process_attachment(page["id"], ocr_languages)
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 557, in process_attachment
    text = title + self.process_doc(absolute_url)
  File "/Users/user/project/.venv/lib/python3.10/site-packages/langchain/document_loaders/confluence.py", line 657, in process_doc
    return docx2txt.process(file_data)
  File "/Users/user/project/.venv/lib/python3.10/site-packages/docx2txt/docx2txt.py", line 88, in process
    text += xml2text(zipf.read(doc_xml))
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1464, in read
    with self.open(name, "r", pwd) as fp:
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1503, in open
    zinfo = self.getinfo(name)
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/zipfile.py", line 1430, in getinfo
    raise KeyError(
KeyError: "There is no item named 'word/document.xml' in the archive"

This should probably be wrapped as such:

if doc_xml not in zipf.namelist():
    # Handle the missing file - skip it/soft error/etc
else:
    text += xml2text(zipf.read(doc_xml))
...

ventz avatar Oct 27 '23 02:10 ventz

This fixes it:

...
 # get main text
    doc_xml = 'word/document.xml'
    if doc_xml not in zipf.namelist():
        print(f"{doc_xml} not in ZIP Namelist")
    else:
        text += xml2text(zipf.read(doc_xml))
...

ventz avatar Oct 27 '23 02:10 ventz