bug/Unable to download NLTK data
Describe the bug Since the change was made to no longer use nltk.download() my application cannot download the required NLTK packages. The application is behind a firewall and we are only allowed to except specific traffic, and a public S3 bucket is proving difficult to get approved.
I get an error when it attempts to download the packages:
<urlopen error [Errno 104] Connection reset by peer>
To Reproduce Use a partitioner that requires NLTK
Expected behavior NLTK package download doesn't fail
Additional context Perhaps there is a way to include the required NLTK packages or pre-download them before the application is zipped and deployed?
This sounds like an IT problem unrelated to the framework, if you're behind a firewall how do you expect to download any NLTK packages?
I would recommend you Dockerize and cache the dependencies, building your container somewhere with internet access.
ENV NLTK_DATA=/usr/share/nltk_data
RUN mkdir -p $NLTK_DATA && chmod -R 777 $NLTK_DATA
RUN python -m nltk.downloader -d $NLTK_DATA stopwords punkt averaged_perceptron_tagger
We already have an exception for the NLTK packages as they are downloaded from GitHub, and this exception was already in place to allow certain Python packages and Oryx builds to work.
I'm just saying that someone else may encounter this same issue, as most IT departments won't allow access to an unknown public S3 bucket.
I encountered the following function in unstructured/nlp/tokenize.py:
def _download_nltk_packages_if_not_present():
"""If required NLTK packages are not available, download them."""
tagger_available = check_for_nltk_package(
package_category="taggers",
package_name="averaged_perceptron_tagger_eng",
)
tokenizer_available = check_for_nltk_package(
package_category="tokenizers",
package_name="punkt_tab",
)
if not tokenizer_available or not tagger_available:
download_nltk_packages()
To address this, I ensured the required NLTK packages averaged_perceptron_tagger_eng and punkt_tab were downloaded in the Dockerfile:
ENV NLTK_DATA=/usr/share/nltk_data
RUN mkdir -p $NLTK_DATA && chmod -R 777 $NLTK_DATA
RUN python3 -m nltk.downloader -d $NLTK_DATA punkt_tab averaged_perceptron_tagger_eng
Additionally, I added the /usr/share/nltk_data path in the project code:
import nltk
nltk.data.path.append("/usr/share/nltk_data")
This solution works for me.
Closing as inactive. Please feel free to reopen if you're still having trouble. Some changes since the last comment seem to have remedied this.