concerning the destination path for `download_corpora`

Open zackmdavis opened this issue 9 years ago • 1 comments

Unsure if it was the tool I wanted for my task at hand, I installed TextBlob in a virtualenv, tried to call .sentences on a TextBlob, and it failed with a message saying that I needed to run python -m textblob.download_corpora to download additional data needed for this feature. I did so, and was somewhat surprised and disappointed to find that it created a non-dotfile directory in my $HOME—I would have expected and wanted it to stay within the virtualenv. It looks like we're calling a download function from nltk, which does seem to have a download_dir kwarg, so downloading the corpora to a less-intrusive place by default (or making the destination directory prominently configurable) seems like a plausibly feasible user-experience enhancement.

(Sorry, I feel guilty about filing an issue without a patch, but ...)

Jan 24 '16 05:01 zackmdavis

The documented method is to set the NLTK_DATA environment variable, though this suggests that the setting is not honoured by the NLTK downloader anyway, though it's used to find the corpus at runtime if you managed to get the data there another way.

I'm in favor of:

exposing the download_dir kwarg, but leave the defaults for consistency with NLTK.
Improving the docs to clarify how to download to a custom dir (download_dir kwarg, and to then use it (set the NLTK_DATA environment variable).

It's not pretty. Open to suggestions on a better approach.

Jan 03 '18 01:01 jschnurr