nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

License

Open djsutherland opened this issue 7 years ago • 7 comments
trafficstars

Can you clarify what license the nltk_data files are under? Is it the same license as nltk? Do the various data files have different licenses? conda-forge would like to begin packagaing nltk_data, because a few users have requested it (to make installing more uniform / track versioning / etc; https://github.com/conda-forge/staged-recipes/pull/4463), but we'd need to know the license first.

djsutherland avatar Nov 27 '17 02:11 djsutherland

The different resources in nltk_data comes in different licenses. The licenses of the individual resources in nltk_data should be safe for re-distribution.

It'll be great to package nltk_data, would it be a pip-able data library?

alvations avatar Nov 27 '17 03:11 alvations

It wouldn't be in pip, but you could get it with conda install nltk_data (assuming you've set up conda-forge: https://conda-forge.org).

I see now that the xml files specify the licenses of the data files. I guess the question is what license the xml files themselves have...they're so small that I doubt it really matters, but still not technically specified. Anyway, I guess we'll just say "License: Various" or whatever, still need to figure that out amongst ourselves though.

djsutherland avatar Nov 27 '17 18:11 djsutherland

s in One of our NLP project is completely dependent on NLTK tokenizer and POS tagger. But recently we figured out that the tokenizer and POS tagger models do not have a license and hence we are not able to use them in our project. Is it possible to add a license for those two models? Is there any other models available in the net for tokenizer and POS tagger which is open source?

saswata64900 avatar Nov 20 '18 07:11 saswata64900

This remains a problem for distributions packaging nltk. Looking at https://www.nltk.org/nltk_data/, many of the fields have a blank licence/copyright field.

Would it be possible for nltk to construct a free/libre dataset which can be safely redistributed? Thanks.

thesamesam avatar Nov 23 '22 04:11 thesamesam

Many of the NLTK data resources themselves contain licensing, copyright or README files that contain additional information on to what extent the data may be distributed. Perhaps that will help somewhat.

tomaarsen avatar Dec 06 '22 13:12 tomaarsen

I did end up untarring the whole lot and taking a look but many of them had either no README (etc) or if they did have one, indicated they were proprietary.

thesamesam avatar Dec 06 '22 13:12 thesamesam

For the record, I'm removing NLTK from Gentoo because of this. IANAL but it looks like many of the corpora shouldn't be redistributed as part of nltk_data in the first place, and letting NLTK download them puts users at risk of copyright violation.

mgorny avatar Dec 16 '22 05:12 mgorny