langchain-tutorials
langchain-tutorials copied to clipboard
UnstructuredPDFLoader zipfile.BadZipFile: File is not a zip file
Hi there, I was trying Ask a book question tutorial. However I was stuck in the third line
data = loader.load()
.
Do you have any idea why it says my document was not a zip file? It is loading a PDF actually.
here is the stacktrace:
Traceback (most recent call last):
File "/Users/serena/Documents/langchain-tutorials/data_generation/chatPDF.py", line 5, in <module>
data = loader.load()
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py", line 61, in load
elements = self._get_elements()
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/pdf.py", line 27, in _get_elements
from unstructured.partition.pdf import partition_pdf
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/pdf.py", line 19, in <module>
from unstructured.partition.text import partition_text
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/text.py", line 16, in <module>
from unstructured.partition.text_type import (
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/text_type.py", line 21, in <module>
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/nlp/tokenize.py", line 32, in <module>
_download_nltk_package_if_not_present(package_name, package_category)
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
nltk.find(f"{package_category}/{package_name}")
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 555, in find
return find(modified_name, paths)
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 542, in find
return ZipFilePathPointer(p, zipentry)
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 394, in __init__
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 935, in __init__
zipfile.ZipFile.__init__(self, filename)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/zipfile.py", line 1257, in __init__
self._RealGetContents()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Unstructured gives a ton of people problems. I'm going edit the code and give more options to people.
Thanks for bringing this up and look at the code in a couple hours and I'll have it up
Just updated https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Ask%20A%20Book%20Questions.ipynb