Eric Kafe

Results 148 comments of Eric Kafe

@naktinis, in the current NLTK version 3.9.1, the tokenizer no longer uses pickles, so the data package to download is now _punkt_tab_. So your test script in #3248 first needs...

Thanks @naktinis! Indeed, in case the parallel processes are launched directly from the OS, it would be safer to write the files atomically. Or maybe some robust filelock can be...

@naktinis, in which realistic use case would you want to download the same package multiple times at once to the same file location? > When calling nltk.download followed by model...

@naktinis, it sounds like your problem is not specific to nltk: any software that your workers install would meet race conditions. Solving the problem in nltk would not solve it...

> Hugging Face model download logic also uses the "download to a temporary file and move to destination" strategy Ok, it seems prudent to also use this approach in nltk....

I think that this issue deserves a high priority. In the present situation, some distributed clusters are in reality performing an attack on the download servers. NLTK should provide a...

@Dunedan, what would you suggest? Maybe print a warning and ask for confirmation before loading a pickle? The pickles in question contain Python classes with executable functions. They are not...

Thanks @dunedan for your detailed suggestions. Should this issue be labelled critical? I wonder if it could make NLTK deemed unsafe for use in schools, or inclusion in some software...

In addition to the two packages already mentioned, the following also contain pickles: - chunkers/maxent_ne_chunker.zip - help/tagsets.zip - taggers/maxent_treebank_pos_tagger.zip In total nltk_data contains 52 pickles, where half are Python 2,...

``` #!/bin/sh for f in ~/nltk_data/*/*zip do unzip -l $f | grep -i pickle >> Nltk_Pickles.txt done ``` [Nltk_Pickles.txt](https://github.com/user-attachments/files/16054731/Nltk_Pickles.txt)