nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

Prepare for FOSS compliance

Open ekaf opened this issue 6 months ago • 3 comments

This PR is intended to address Issue #102 by documenting a possible way to split nltk_data into OSI (Open Source Initiative)-compliant and nonfree parts.

Why use the OSI rather than the FSF definition of free?

The overwhelming majority of major software and data distributors (Linux distros, conda-forge, Homebrew, etc.) use the OSI definition as their primary standard. The FSF definition is important for the free software movement and documentation/content (e.g., GNU, Wikimedia), but is not the baseline for most mainstream software/data distribution channels.

Two markdown files are introduced:

  • free_packages_osi.md: Packages with OSI-approved, public domain, or similarly permissive licenses.
  • nonfree_packages_osi.md: Packages with more restrictive, ambiguous, or otherwise non-OSI-compliant licenses.

Every effort has been made to classify each package based on available license information, but feedback and corrections are very welcome—especially for any unclear or disputed cases.

Discussion is welcome and encouraged! If you spot anything that should be reviewed or improved, please join the conversation.

ekaf avatar Jun 27 '25 08:06 ekaf

The proposed list of free licences should probably be wider than just the OSI-approved software licenses.

Here's why:

  • OSI focuses on Software: The OSI defines "open source" specifically for software.
  • Data has other "free" licenses: Many licenses are equally permissive and FOSS-compatible for data, content, or standards, even if not OSI-approved. Examples include:
    • Public Domain (e.g., CC0)
    • Permissive Creative Commons (e.g., CC BY)
    • Specific standards licenses (e.g., Unicode Terms of Use, IETF Trust License, W3C Document License) These licenses grant essential freedoms (use, modify, redistribute, including commercially) for data.

Crucially, this broader definition of "free" still firmly excludes:

  • Non-Commercial (NC) or No Derivatives (ND) licenses.
  • "Academic Use Only" or "Research Use Only" restrictions.
  • Ambiguous or truly "unknown" licenses (like Punkt's).

ekaf avatar Jun 29 '25 08:06 ekaf

An audit of all packages in nltk_data/index.xml has been performed from a FOSS (Free and Open Source Software) compliance perspective. This comprehensive and exhaustive categorization of all packages has resulted in two new files added to this pull request:

  • free_packages_foss.md: This document lists packages with clear, FOSS-compliant licenses (such as MIT, GPL, CC BY) as well as a new "Rescued Packages" section for those that are widely used and assumed to be free despite ambiguous or unstated licensing terms.

  • nonfree_packages_foss.md: This document lists packages that are non-compliant with FOSS principles, either due to explicit restrictions (e.g., non-commercial use only) or highly ambiguous license statements.

These two lists together provide a complete overview of the licensing status for every single package in the NLTK data collection.

ekaf avatar Aug 03 '25 10:08 ekaf

Marking this PR as "Ready for Review" to encourage broader feedback and community input.

While I anticipate some modifications may be necessary, the current state provides a solid foundation for discussion and refinement regarding FOSS compliance. All feedback and suggestions are welcome!

ekaf avatar Aug 03 '25 11:08 ekaf