Prepare for FOSS compliance
This PR is intended to address Issue #102 by documenting a possible way to split nltk_data into OSI (Open Source Initiative)-compliant and nonfree parts.
Why use the OSI rather than the FSF definition of free?
The overwhelming majority of major software and data distributors (Linux distros, conda-forge, Homebrew, etc.) use the OSI definition as their primary standard. The FSF definition is important for the free software movement and documentation/content (e.g., GNU, Wikimedia), but is not the baseline for most mainstream software/data distribution channels.
Two markdown files are introduced:
-
free_packages_osi.md: Packages with OSI-approved, public domain, or similarly permissive licenses. -
nonfree_packages_osi.md: Packages with more restrictive, ambiguous, or otherwise non-OSI-compliant licenses.
Every effort has been made to classify each package based on available license information, but feedback and corrections are very welcome—especially for any unclear or disputed cases.
Discussion is welcome and encouraged! If you spot anything that should be reviewed or improved, please join the conversation.
The proposed list of free licences should probably be wider than just the OSI-approved software licenses.
Here's why:
- OSI focuses on Software: The OSI defines "open source" specifically for software.
-
Data has other "free" licenses: Many licenses are equally permissive and FOSS-compatible for data, content, or standards, even if not OSI-approved. Examples include:
- Public Domain (e.g., CC0)
- Permissive Creative Commons (e.g., CC BY)
- Specific standards licenses (e.g., Unicode Terms of Use, IETF Trust License, W3C Document License) These licenses grant essential freedoms (use, modify, redistribute, including commercially) for data.
Crucially, this broader definition of "free" still firmly excludes:
- Non-Commercial (NC) or No Derivatives (ND) licenses.
- "Academic Use Only" or "Research Use Only" restrictions.
- Ambiguous or truly "unknown" licenses (like Punkt's).
An audit of all packages in nltk_data/index.xml has been performed from a FOSS (Free and Open Source Software) compliance perspective. This comprehensive and exhaustive categorization of all packages has resulted in two new files added to this pull request:
-
free_packages_foss.md: This document lists packages with clear, FOSS-compliant licenses (such as MIT, GPL, CC BY) as well as a new "Rescued Packages" section for those that are widely used and assumed to be free despite ambiguous or unstated licensing terms.
-
nonfree_packages_foss.md: This document lists packages that are non-compliant with FOSS principles, either due to explicit restrictions (e.g., non-commercial use only) or highly ambiguous license statements.
These two lists together provide a complete overview of the licensing status for every single package in the NLTK data collection.
Marking this PR as "Ready for Review" to encourage broader feedback and community input.
While I anticipate some modifications may be necessary, the current state provides a solid foundation for discussion and refinement regarding FOSS compliance. All feedback and suggestions are welcome!