unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/<raise BadZipFile>

Open Vampxgg opened this issue 11 months ago • 8 comments

Describe the bug No matter how I use it, it will report raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

Screenshots 1736934512502

Environment Info Traceback (most recent call last): File "D:\pythonprojects\LANGCHAIN\main.py", line 87, in elements = partition_pdf("D:\pythonprojects\LANGCHAIN\inputs\智能传感器装配调试台架-产品手册.pdf") File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\documents\elements.py", line 581, in wrapper elements = func(*args, **kwargs) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\file_utils\filetype.py", line 725, in wrapper elements = func(*args, **kwargs) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\file_utils\filetype.py", line 683, in wrapper elements = func(*args, **kwargs) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\pdf.py", line 209, in partition_pdf return partition_pdf_or_image( File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\pdf.py", line 350, in partition_pdf_or_image out_elements = _process_uncategorized_text_elements(elements) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\pdf.py", line 930, in _process_uncategorized_text_elements new_el = element_from_text(cast(Text, el).text) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text.py", line 149, in element_from_text elif is_possible_narrative_text(text): File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text_type.py", line 74, in is_possible_narrative_text if exceeds_cap_ratio(text, threshold=cap_threshold): File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text_type.py", line 270, in exceeds_cap_ratio if sentence_count(text, 3) > 1: File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\partition\text_type.py", line 219, in sentence_count sentences = sent_tokenize(text) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\nlp\tokenize.py", line 56, in sent_tokenize _download_nltk_packages_if_not_present() File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\nlp\tokenize.py", line 41, in _download_nltk_packages_if_not_present tagger_available = check_for_nltk_package( File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\unstructured\nlp\tokenize.py", line 29, in check_for_nltk_package nltk.find(f"{package_category}/{package_name}", paths=paths) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 551, in find return find(modified_name, paths) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 538, in find return ZipFilePathPointer(p, zipentry) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 391, in init zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile)) File "D:\miniconda\envs\LANGCHAIN\lib\site-packages\nltk\data.py", line 1020, in init zipfile.ZipFile.init(self, filename) File "D:\miniconda\envs\LANGCHAIN\lib\zipfile.py", line 1268, in init self._RealGetContents() File "D:\miniconda\envs\LANGCHAIN\lib\zipfile.py", line 1335, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

Vampxgg avatar Jan 15 '25 09:01 Vampxgg

Same here, how did you fix it?

kiddoray avatar Mar 28 '25 07:03 kiddoray

Same here, how did you fix it?同样如此,您是怎么解决的? I haven’t found a better solution so far. I hope the official can deal with it as soon as possible. I also hope you have any good ideas to tell me!

Vampxgg avatar Mar 28 '25 09:03 Vampxgg

same issue for me now. just wanted to follow this thread for resolution

abbyDC avatar Apr 07 '25 07:04 abbyDC

same issue version:0.17.2

Junon-Gz avatar May 15 '25 03:05 Junon-Gz

Hey everyone, I found a solution to an issue I encountered. I'm pretty new to this project and ran into the same problem when running the examples. After some investigation, I realized the issue was caused by missing nltk_data.

Here's how I solved it:

  1. Manually download nltk_data to a specified folder:

    nltk.download(download_dir=nltk_data_path)
    
  2. Set the environment variable for NLTK data:

    NLTK_DATA = nltk_data_path
    

I'm still learning why tokenization is used here, but when I printed nltk.data.path, I saw there were many paths involved — so I'm digging deeper into this powerful project.

Hope this helps you too!

baijiu-in-my-cup avatar Jun 12 '25 11:06 baijiu-in-my-cup

@KaLe-Baijiu Do you download all nltk packages? I tried your solution, but can't work.

raykin avatar Jul 04 '25 08:07 raykin

@KaLe-Baijiu Do you download all nltk packages? I tried your solution, but can't work.

@raykin Yes,I download all

I checked the source code,seems that only need averaged_perceptron_tagger_eng and punkt_tab . You try it?

Image

baijiu-in-my-cup avatar Jul 04 '25 11:07 baijiu-in-my-cup

@KaLe-Baijiu Well, I faced this error in another project which relies on unstructuredIO. The error is very tricky and confusing and finally lead me to here. I'll try download all data later. I curious if the source code can auto download the data before it raise zip error? Because my local sometimes has network block, so I'm not sure if auto download has blocked.

Update: btw I think the error message is very bad. It should alert it's looking for a particular file with exactly name. In my usage case, I'm uploading and processing a PDF file use a python app then raise this error 'File is not a zip' and then the app's devs are think I uploading a zip and I can't even explain myself because I'm in remote. Devs are thinking I make mistake by taking zip as PDF, so ridiculous.

raykin avatar Jul 05 '25 05:07 raykin