private-gpt
private-gpt copied to clipboard
ingest.py on eml throws zipfile.BadZipFile: File is not a zip file
Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.
Describe the bug and how to reproduce it ingest.py on source_documents folder with many with eml files throws zipfile.BadZipFile: File is not a zip file
Expected behavior Expecting that eml files will be loaded
Environment (please complete the following information):
- macOS 12.6 / M1
- Python version 3.11.3
Additional context
Loading new documents: 0%| | 0/75093 [00:08<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 49, in load
doc = UnstructuredEmailLoader.load(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/email.py", line 22, in _get_elements
from unstructured.partition.email import partition_email
File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/email.py", line 41, in
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 89, in load_single_document return loader.load()[0] ^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 59, in load raise type(e)(f"{self.file_path}: {e}") from e zipfile.BadZipFile: source_documents/2013-01-03 095102 dea8d7fd13.eml: File is not a zip file """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 167, in
Same here. I only have txt, html and pdf files. And I get that zip error.
Its happening at import time. This is the cause: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/nlp/tokenize.py#LL31C39-L31C52 When I try to unzip this myself it fails. I found in the nltk_data repo you can do
>>> import nltk
nltk.download()
Which led me to find https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml. I downloaded averaged_perception_tagger. unzipped that to ~/nltk_data/taggers/ and extracted. But then I get new errors.
Traceback (most recent call last):
File "/home/james/projects/privateGPT/ingest.py", line 50, in load
doc = UnstructuredEmailLoader.load(self)
File "/home/james/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
elements = self._get_elements()
File "/home/james/.local/lib/python3.10/site-packages/langchain/document_loaders/email.py", line 24, in _get_elements
return partition_email(filename=self.file_path, **self.unstructured_kwargs)
File "/home/james/.local/lib/python3.10/site-packages/unstructured/partition/email.py", line 265, in partition_email
element.apply(_replace_mime_encodings)
File "/home/james/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 154, in apply
cleaned_text = cleaner(cleaned_text)
File "/home/james/.local/lib/python3.10/site-packages/unstructured/cleaners/core.py", line 197, in replace_mime_encodings
return quopri.decodestring(text.encode()).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 2195: invalid continuation byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/james/projects/privateGPT/ingest.py", line 90, in load_single_document
return loader.load()[0]
File "/home/james/projects/privateGPT/ingest.py", line 60, in load
raise type(e)(f"{self.file_path}: {e}") from e
TypeError: function takes exactly 5 arguments (1 given)
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/james/projects/privateGPT/ingest.py", line 168, in <module>
main()
File "/home/james/projects/privateGPT/ingest.py", line 158, in main
texts = process_documents()
File "/home/james/projects/privateGPT/ingest.py", line 120, in process_documents
documents = load_documents(source_directory, ignored_files)
File "/home/james/projects/privateGPT/ingest.py", line 109, in load_documents
for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
TypeError: function takes exactly 5 arguments (1 given)
Well, Thanks, it helped to understand that I need to remove the nltk_data folder and it's solved issue with badzip, but now I have exactly same issue with TypeError: function takes exactly 5 arguments (1 given)
Just to report I'm seeing the same issue. Thanks for looking into this.
Getting the same issue with any file that is not a .pdf
I linked my folder to source_documents
on my Linux machine and got the same issue as well
Having the same issue. I downloaded nltk_data again and now I have this error:
Creating new vectorstore
Loading documents from source_documents
Loading new documents: 0%| | 0/603 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 89, in load_single_document
return loader.load()[0]
^^^^^^^^^^^^^
File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/andi/.local/lib/python3.11/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
return convert_and_partition_html(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/andi/.local/lib/python3.11/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/andi/.local/lib/python3.11/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
html_text = convert_file_to_text(
^^^^^^^^^^^^^^^^^^^^^
File "/home/andi/.local/lib/python3.11/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
text = pypandoc.convert_file(filename, target_format, format=source_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/andi/.local/lib/python3.11/site-packages/pypandoc/__init__.py", line 164, in convert_file
format = _identify_format_from_path(discovered_source_files[0], format)
~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 167, in <module>
main()
File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 157, in main
texts = process_documents()
^^^^^^^^^^^^^^^^^^^
File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 119, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 108, in load_documents
for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
IndexError: list index out of range
As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html
In a nutshell run Python in interactive mode and call:
import nltk nltk.download()
This opens the install dialog. Go to Models tab and install punkt.
Hope this helps and works for others....
This is thrown for ~/nltk_data/taggers/averaged_perceptron_tagger.zip which is really a bad zip. Deleting ~/nltk_data and restarting ingesting downloaded a correct version of this file and now ingestion works for me.
As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html
In a nutshell run Python in interactive mode and call:
import nltk nltk.download()
This opens the install dialog. Go to Models tab and install punkt.
Hope this helps and works for others....
Thank you, it worked. Although I don't need to install punkt, I need to update a package/ model(?) (I don't remember what it called), and the command run.
Currently ingesting my data and stuck in this line "Using embedded DuckDB with persistence: data will be stored in: db" (but my CPU is still running so I guess it's fine?", will update the result if everything go smoothly.
solved it with the suggestions offered here. Also found out that if you try and ingest too many documents at once, it chokes on it. Feeding around 20-30 documents works fine. Around 50, works fine but gets very slow. Also, some PDFs may give fail to be ingested.
Can someone please translate this to idiot-speak for me?
As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html
In a nutshell run Python in interactive mode and call:
import nltk nltk.download()
This opens the install dialog. Go to Models tab and install punkt.
Hope this helps and works for others....
This is thrown for ~/nltk_data/taggers/averaged_perceptron_tagger.zip which is really a bad zip. Deleting ~/nltk_data and restarting ingesting downloaded a correct version of this file and now ingestion works for me.
That worked for me. Thank you @adi!
First, Delete ~/nltk_data/taggers/averaged_perceptron_tagger.zip; Second,
import nltk
nltk.download()
chose averaged_perceptron_tagger to download
I am able to get mine to run, but trying to process emails, there are tonnes of issues with unicode, date parsing, etc for emails exported from thunderbird.
If like me you have a broken TK install, you can also force the NLTK download from the CLI:
python -m nltk.downloader all
@ElementalWarrior same here, unicode issues, exported from Gmail, and eventually process fails. Opened another bug, but there's no any answer there : https://github.com/imartinez/privateGPT/issues/378 and also opened a ticket in unstrctured, but also there no one answers : https://github.com/Unstructured-IO/unstructured/issues/635
If like me you have a broken TK install, you can also force the NLTK download from the CLI:
python -m nltk.downloader all
This fixed it for me, thanks!
Same issue on my M1 laptop, this did it for me.
python3
import nltk
nltk.download()
# Select Download menu
d
# Enter identifier
averaged_perceptron_tagger
# Select Download menu
d
# Enter identifier
punkt
I guess the download all approach is easier and works too but is unnecessary.
How did you solve it, can you directly package it into a python script to run?
First, Delete ~/nltk_data/taggers/averaged_perceptron_tagger.zip; Second,
import nltk nltk.download()
chose averaged_perceptron_tagger to download
This solved my error on chatdocs as well. Thanks!!
type 'python' to get to the interactive screen to run import nltk nltk.download()