private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

ingest.py on eml throws zipfile.BadZipFile: File is not a zip file

Open slavag opened this issue 1 year ago • 19 comments

Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.

Describe the bug and how to reproduce it ingest.py on source_documents folder with many with eml files throws zipfile.BadZipFile: File is not a zip file

Expected behavior Expecting that eml files will be loaded

Environment (please complete the following information):

  • macOS 12.6 / M1
  • Python version 3.11.3

Additional context Loading new documents: 0%| | 0/75093 [00:08<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 49, in load doc = UnstructuredEmailLoader.load(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/email.py", line 22, in _get_elements from unstructured.partition.email import partition_email File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/email.py", line 41, in from unstructured.partition.html import partition_html File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 6, in from unstructured.documents.html import HTMLDocument File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 25, in from unstructured.partition.text_type import ( File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 21, in from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 32, in _download_nltk_package_if_not_present(package_name, package_category) File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present nltk.find(f"{package_category}/{package_name}") File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 555, in find return find(modified_name, paths) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 542, in find return ZipFilePathPointer(p, zipentry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator return init_func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 394, in init zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator return init_func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 935, in init zipfile.ZipFile.init(self, filename) File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1301, in init self._RealGetContents() File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1368, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 89, in load_single_document return loader.load()[0] ^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 59, in load raise type(e)(f"{self.file_path}: {e}") from e zipfile.BadZipFile: source_documents/2013-01-03 095102 dea8d7fd13.eml: File is not a zip file """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 167, in main() File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 157, in main texts = process_documents() ^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 108, in load_documents for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value zipfile.BadZipFile: source_documents/2013-01-03 095102 dea8d7fd13.eml: File is not a zip file

slavag avatar May 21 '23 15:05 slavag

Same here. I only have txt, html and pdf files. And I get that zip error.

AntouanK avatar May 21 '23 21:05 AntouanK

Its happening at import time. This is the cause: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/nlp/tokenize.py#LL31C39-L31C52 When I try to unzip this myself it fails. I found in the nltk_data repo you can do

>>> import nltk
nltk.download()

Which led me to find https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml. I downloaded averaged_perception_tagger. unzipped that to ~/nltk_data/taggers/ and extracted. But then I get new errors.

Traceback (most recent call last):
  File "/home/james/projects/privateGPT/ingest.py", line 50, in load
    doc = UnstructuredEmailLoader.load(self)
  File "/home/james/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/home/james/.local/lib/python3.10/site-packages/langchain/document_loaders/email.py", line 24, in _get_elements
    return partition_email(filename=self.file_path, **self.unstructured_kwargs)
  File "/home/james/.local/lib/python3.10/site-packages/unstructured/partition/email.py", line 265, in partition_email
    element.apply(_replace_mime_encodings)
  File "/home/james/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 154, in apply
    cleaned_text = cleaner(cleaned_text)
  File "/home/james/.local/lib/python3.10/site-packages/unstructured/cleaners/core.py", line 197, in replace_mime_encodings
    return quopri.decodestring(text.encode()).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 2195: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/james/projects/privateGPT/ingest.py", line 90, in load_single_document
    return loader.load()[0]
  File "/home/james/projects/privateGPT/ingest.py", line 60, in load
    raise type(e)(f"{self.file_path}: {e}") from e
TypeError: function takes exactly 5 arguments (1 given)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/james/projects/privateGPT/ingest.py", line 168, in <module>
    main()
  File "/home/james/projects/privateGPT/ingest.py", line 158, in main
    texts = process_documents()
  File "/home/james/projects/privateGPT/ingest.py", line 120, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/home/james/projects/privateGPT/ingest.py", line 109, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
TypeError: function takes exactly 5 arguments (1 given)

ElementalWarrior avatar May 22 '23 05:05 ElementalWarrior

Well, Thanks, it helped to understand that I need to remove the nltk_data folder and it's solved issue with badzip, but now I have exactly same issue with TypeError: function takes exactly 5 arguments (1 given)

slavag avatar May 22 '23 09:05 slavag

Just to report I'm seeing the same issue. Thanks for looking into this.

kulnor avatar May 22 '23 20:05 kulnor

Getting the same issue with any file that is not a .pdf

ericflecher avatar May 23 '23 01:05 ericflecher

I linked my folder to source_documents on my Linux machine and got the same issue as well

BacKinnn avatar May 23 '23 10:05 BacKinnn

Having the same issue. I downloaded nltk_data again and now I have this error:

Creating new vectorstore
Loading documents from source_documents
Loading new documents:   0%|                            | 0/603 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
           ^^^^^^^^^^^^^
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
    return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
    return convert_and_partition_html(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
    html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
    html_text = convert_file_to_text(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
    text = pypandoc.convert_file(filename, target_format, format=source_format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/pypandoc/__init__.py", line 164, in convert_file
    format = _identify_format_from_path(discovered_source_files[0], format)
                                        ~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
            ^^^^^^^^^^^^^^^^^^^
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
IndexError: list index out of range 

conradolandia avatar May 23 '23 17:05 conradolandia

As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html

In a nutshell run Python in interactive mode and call:

import nltk nltk.download()

This opens the install dialog. Go to Models tab and install punkt.

Hope this helps and works for others....

kulnor avatar May 24 '23 04:05 kulnor

This is thrown for ~/nltk_data/taggers/averaged_perceptron_tagger.zip which is really a bad zip. Deleting ~/nltk_data and restarting ingesting downloaded a correct version of this file and now ingestion works for me.

adi avatar May 24 '23 07:05 adi

As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html

In a nutshell run Python in interactive mode and call:

import nltk nltk.download()

This opens the install dialog. Go to Models tab and install punkt.

Hope this helps and works for others....

Thank you, it worked. Although I don't need to install punkt, I need to update a package/ model(?) (I don't remember what it called), and the command run.

Currently ingesting my data and stuck in this line "Using embedded DuckDB with persistence: data will be stored in: db" (but my CPU is still running so I guess it's fine?", will update the result if everything go smoothly.

BacKinnn avatar May 24 '23 07:05 BacKinnn

solved it with the suggestions offered here. Also found out that if you try and ingest too many documents at once, it chokes on it. Feeding around 20-30 documents works fine. Around 50, works fine but gets very slow. Also, some PDFs may give fail to be ingested.

conradolandia avatar May 25 '23 01:05 conradolandia

Can someone please translate this to idiot-speak for me?

As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html

In a nutshell run Python in interactive mode and call:

import nltk nltk.download()

This opens the install dialog. Go to Models tab and install punkt.

Hope this helps and works for others....

tfyt2023 avatar May 25 '23 04:05 tfyt2023

This is thrown for ~/nltk_data/taggers/averaged_perceptron_tagger.zip which is really a bad zip. Deleting ~/nltk_data and restarting ingesting downloaded a correct version of this file and now ingestion works for me.

That worked for me. Thank you @adi!

ThomasFeher avatar May 25 '23 06:05 ThomasFeher

First, Delete ~/nltk_data/taggers/averaged_perceptron_tagger.zip; Second,

import nltk
nltk.download()

chose averaged_perceptron_tagger to download

cutd avatar May 25 '23 08:05 cutd

I am able to get mine to run, but trying to process emails, there are tonnes of issues with unicode, date parsing, etc for emails exported from thunderbird.

ElementalWarrior avatar May 25 '23 15:05 ElementalWarrior

If like me you have a broken TK install, you can also force the NLTK download from the CLI:

python -m nltk.downloader all

DaniruKun avatar May 26 '23 12:05 DaniruKun

@ElementalWarrior same here, unicode issues, exported from Gmail, and eventually process fails. Opened another bug, but there's no any answer there : https://github.com/imartinez/privateGPT/issues/378 and also opened a ticket in unstrctured, but also there no one answers : https://github.com/Unstructured-IO/unstructured/issues/635

slavag avatar May 27 '23 08:05 slavag

If like me you have a broken TK install, you can also force the NLTK download from the CLI:

python -m nltk.downloader all

This fixed it for me, thanks!

DukeOfEtiquette avatar Jun 01 '23 22:06 DukeOfEtiquette

Same issue on my M1 laptop, this did it for me.

python3
import nltk
nltk.download()

# Select Download menu
d
# Enter identifier
averaged_perceptron_tagger

# Select Download menu
d
# Enter identifier
punkt

I guess the download all approach is easier and works too but is unnecessary.

Jolg42 avatar Jun 04 '23 21:06 Jolg42

How did you solve it, can you directly package it into a python script to run?

Friedrich-hue avatar Jul 02 '23 09:07 Friedrich-hue

First, Delete ~/nltk_data/taggers/averaged_perceptron_tagger.zip; Second,

import nltk
nltk.download()

chose averaged_perceptron_tagger to download

This solved my error on chatdocs as well. Thanks!!

williamblair333 avatar Sep 04 '23 16:09 williamblair333

type 'python' to get to the interactive screen to run import nltk nltk.download()

mweth avatar Sep 07 '23 03:09 mweth