Describe the bug
No matter how I use it, it will report raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Screenshots

Environment Info
Traceback (most recent call last):
File "D:\pythonprojects\LANGCHAIN\main.py", line 87, in
elements = partition_pdf("D:\pythonprojects\LANGCHAIN\inputs\智能传感器装配调试台架-产品手册.pdf")
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\documents\elements.py", line 581, in wrapper
elements = func(*args, **kwargs)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\file_utils\filetype.py", line 725, in wrapper
elements = func(*args, **kwargs)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\file_utils\filetype.py", line 683, in wrapper
elements = func(*args, **kwargs)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\pdf.py", line 209, in partition_pdf
return partition_pdf_or_image(
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\pdf.py", line 350, in partition_pdf_or_image
out_elements = _process_uncategorized_text_elements(elements)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\pdf.py", line 930, in _process_uncategorized_text_elements
new_el = element_from_text(cast(Text, el).text)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text.py", line 149, in element_from_text
elif is_possible_narrative_text(text):
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text_type.py", line 74, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text_type.py", line 270, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text_type.py", line 219, in sentence_count
sentences = sent_tokenize(text)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\nlp\tokenize.py", line 56, in sent_tokenize
_download_nltk_packages_if_not_present()
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\nlp\tokenize.py", line 41, in _download_nltk_packages_if_not_present
tagger_available = check_for_nltk_package(
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\nlp\tokenize.py", line 29, in check_for_nltk_package
nltk.find(f"{package_category}/{package_name}", paths=paths)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 551, in find
return find(modified_name, paths)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 538, in find
return ZipFilePathPointer(p, zipentry)
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 391, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 1020, in init
zipfile.ZipFile.init(self, filename)
File "D:\miniconda\envs\LANGCHAIN\lib\zipfile.py", line 1268, in init
self._RealGetContents()
File "D:\miniconda\envs\LANGCHAIN\lib\zipfile.py", line 1335, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Same here, how did you fix it?
Same here, how did you fix it?同样如此,您是怎么解决的?
I haven’t found a better solution so far. I hope the official can deal with it as soon as possible. I also hope you have any good ideas to tell me!
same issue for me now. just wanted to follow this thread for resolution
same issue version:0.17.2
Hey everyone, I found a solution to an issue I encountered. I'm pretty new to this project and ran into the same problem when running the examples. After some investigation, I realized the issue was caused by missing nltk_data.
Here's how I solved it:
-
Manually download nltk_data to a specified folder:
nltk.download(download_dir=nltk_data_path)
-
Set the environment variable for NLTK data:
NLTK_DATA = nltk_data_path
I'm still learning why tokenization is used here, but when I printed nltk.data.path, I saw there were many paths involved — so I'm digging deeper into this powerful project.
Hope this helps you too!
@KaLe-Baijiu Do you download all nltk packages? I tried your solution, but can't work.
@KaLe-Baijiu Do you download all nltk packages? I tried your solution, but can't work.
@raykin Yes,I download all
I checked the source code,seems that only need averaged_perceptron_tagger_eng and punkt_tab . You try it?

@KaLe-Baijiu Well, I faced this error in another project which relies on unstructuredIO. The error is very tricky and confusing and finally lead me to here.
I'll try download all data later.
I curious if the source code can auto download the data before it raise zip error? Because my local sometimes has network block, so I'm not sure if auto download has blocked.
Update: btw I think the error message is very bad. It should alert it's looking for a particular file with exactly name.
In my usage case, I'm uploading and processing a PDF file use a python app then raise this error 'File is not a zip' and then the app's devs are think I uploading a zip and I can't even explain myself because I'm in remote. Devs are thinking I make mistake by taking zip as PDF, so ridiculous.