Semcor License is likely invalid
Problem Statement
The semcor corpus is currently distributed in nltk_data under the Princeton WordNet License. However, a review of its provenance reveals that this license is likely invalid for distributing the underlying text, making semcor a non-free package.
Reasoning for semcor License Invalidity
-
Derivative Work Status:
semcoris not a new work; it is the Brown Corpus with added semantic annotations. Under copyright law, this makes it a derivative work of the original Brown Corpus. -
Restrictive Source License: The Brown Corpus is unequivocally licensed under restrictive terms by the Linguistic Data Consortium (LDC). The LDC is the sole official licensing authority.
-
No Evidence of Sublicensing: There is no public evidence that Princeton University secured a sublicensing agreement from the LDC that would permit them to strip the LDC's restrictions and re-license the underlying Brown text under a permissive license.
-
Princeton is Not an LDC Member: Investigation confirms Princeton University is not a member of the LDC consortium, eliminating the possibility of special institutional rights that could justify this re-licensing.
Conclusion: Therefore, the Princeton WordNet License attached to semcor is an overreach and is almost certainly invalid for its core content. Distributing semcor relies on academic leniency, not a sound legal basis. It must be classified as non-free.
Why This Does NOT Affect wordnet
It is crucial to understand that the invalidity of the semcor license does not "infect" the WordNet database itself. The legal arguments are distinct:
semcoris a Corpus: It contains the full, expressive text of the copyrighted Brown Corpus. Distributing it copies protected expression.- WordNet is a Database of Facts: WordNet used
semcorto derive sense frequencies—statistical facts about language use. Copyright protects expression, not facts. The structure of WordNet is its own creative, copyrightable work, and the facts it contains are unprotectable.
The creation of WordNet's data from semcor is a textbook example of fair use (highly transformative, non-expressive purpose) and falls under the fact/expression dichotomy. The WordNet database remains on solid legal ground under its permissive Princeton WordNet License.
Proposed Action
- Officially reclassify the
semcorpackage from "free" to "non-free" in thenltk_dataindex and documentation. - Ensure
semcoris included in thenltk-edu(restricted) pip package and excluded from thenltk-free(commercial-safe) package. - Update the
semcordocumentation to clearly state:"Distributed under the Princeton WordNet License, but this license is likely invalid as it is a derivative work of the LDC-licensed Brown Corpus. For academic use only."
This action is necessary to maintain the legal integrity of the NLTK project and protect its users.
Seeking Clarification
To resolve this ambiguity, we would welcome clarification from the WordNet team at Princeton University, particularly Dr. Christiane Fellbaum, regarding the rights obtained for creating and distributing semcor as a derivative work of the Brown Corpus. I am also contacting the WordNet team by email, with an invitation to take part in this discussion.
Hi,
I think that, sadly, you are right, and further it cannot legally be redistributed.
As far as I can tell, Brown is licensed by ICAME not LDC, under the CLARIN_ACA license: https://www.kielipankki.fi/wp-content/uploads/CLARIN_ACA_AFFIL-EDU_NORED_en.html
This is according to this page: https://clarino.uib.no/korpuskel/corpora or here: https://icame.info/icame-corpora/
Unfortunately, it does NOT allow redistribution:
NORED: The user is not permitted to redistribute the resource.
So not only is academic use only, but we may not redistribute it.
Francis
FWIW, I have emailed ICAME asking if they will allow redistribution of the subset used in SemCor.
For distributing SemCor as non-commercial "for academic use only", NLTK should theoretically be covered by the existing sublicense granted from Brown University to the Princeton University/WordNet project. Here's the reasoning:
- Brown Corpus's Non-Commercial Stance The Brown Corpus itself has historically been distributed under a license that restricts use to non-commercial academic research and teaching. Brown University's primary licensing bodies (like ICAME) adhere to this.
- Princeton's Sublicense Authority When Princeton University created SemCor, they had to legally acquire the text from Brown. It is a standard and safe assumption that the agreement between Brown and Princeton granted Princeton the right to use the Brown text for the purpose of creating and distributing SemCor, provided that the resulting SemCor package adheres to the original non-commercial restrictions of the underlying text. • Brown to Princeton: Permission to use the text for the derivative work (SemCor). • Princeton to Licensee (e.g., NLTK user): Permission to use the full package, with the caveat that the original Brown restrictions apply to the text content.
- NLTK's Role as a Distributor NLTK acts as a distribution channel for many corpora. As long as NLTK: • Does not modify the text itself (other than reformatting it for NLTK's reader). • Includes the full, original license notice from Princeton (which includes the compliance statement regarding the Princeton copyright and statements). • Clearly indicates that the resource is non-commercial in its own data catalog. ...then NLTK is distributing the resource under the terms granted to Princeton, which include non-commercial redistribution rights for academic use. The Bottom Line NLTK does not likely need to seek new permission directly from Brown for every SemCor download. The initial, one-time permission for the Brown subset to be included in SemCor was granted to the WordNet/Princeton University project. NLTK is simply re-distributing that derivative work in accordance with the terms of that original permission (which limits the end-user to non-commercial, academic use). The necessity is not asking for permission, but ensuring the licensing terms are clear to the end-user (as discussed in the previous answer) to prevent unauthorized commercial use of the underlying Brown text.
DeepSeek proposes to add a Legal_Notice file to the semcor package, with the following text:
Legal Notice for the SEMCOR Corpus
This package, semcor, is a derivative work with multiple copyright claims.
-
Underlying Text: The textual content is from the Brown Corpus. The Brown Corpus has a complex licensing history. It was distributed by both the Linguistic Data Consortium (LDC) and the International Computer Archive of Modern and Medieval English (ICAME). Evidence indicates that neither distributor permitted free redistribution or the creation of derivative works. The original restrictive terms for the Brown Corpus apply to this underlying text.
-
Semantic Annotations: The sense tags, lemmatization, and syntactic structure are the creative work of the Princeton WordNet team, distributed under the Princeton WordNet License (see the
LICENSEfile in this directory).
Terms of Use
Because this is a derivative work of the restrictively licensed Brown Corpus, the most restrictive terms apply. Therefore, this package is provided for:
Non-Commercial, Academic, and Research Use Only.
Commercial use, redistribution, or the creation of further derivative works is not permitted under these terms.
@fcbond, it seems that the situation is indeed desperate for the brown corpus itself, and that we need to remove it. However, there is probably hope for preserving semcor under a dual, restrictive license.
The Core Legal Dilemma and a Proposed Path Forward
This issue correctly identifies the licensing problem with semcor, but it's critical to recognize that the core issue extends to the brown package itself in NLTK, and the situation is more severe than previously acknowledged.
The Fundamental Problem with LDC Licensing
The NLTK brown package contains the LDC-tagged version of the corpus. The LDC's standard terms for non-members are unambiguous and contain a critical restriction:
"No Distribution. User shall not copy, download, redistribute, transfer, sell, rent, lease, sublicense or otherwise transfer the LDC Data to any person or entity."
This creates an inescapable conclusion:
-
brown: The current distribution violates the LDC's core "No Distribution" clause. Including it in any package, evennon-free, constitutes infringement. -
semcor: As a derivative work, it faces the same fundamental distribution problem.
The "Transformative Work" Argument for semcor
However, there is a crucial legal distinction that might preserve semcor:
semcoris not just the Brown Corpus. It is the Brown Corpus plus a massive, original, creative layer of semantic annotation (WordNet sense tags).- This annotation represents significant scholarly work that transforms the raw text into a new resource for computational lexical semantics.
- Under "fair use" doctrine, creating and using such a transformative work for research and education could be defensible.
Recommended Three-Tiered Solution
Given this analysis, the most legally coherent and pragmatic path is:
-
nltk-freePackage:- Contains only permissively licensed data.
- Excludes both
brownandsemcor.
-
nltk-eduPackage:- Contains
semcorwith a strong, honest disclaimer:
"This corpus is a transformative scholarly work comprising the Brown Corpus with original semantic annotations by Princeton. It is distributed for non-commercial research and educational use under a claim of fair use. The underlying Brown Corpus text is subject to an LDC 'No Distribution' clause; commercial users must obtain a license from the LDC."
- Contains
-
Remove the raw
browncorpus:- Its case for fair use is much weaker than
semcor's. - It represents a straightforward violation of the LDC's "No Distribution" clause.
- Its case for fair use is much weaker than
This approach removes the most blatant infringement (brown) while providing a credible legal rationale for maintaining access to the educationally valuable semcor for academic use. It replaces legal ambiguity with transparent, risk-aware packaging.
Hi everyone, --by way of background: We got permission to use the Brown Corpus (decades ago) thanks to friendly personal relations with the creators; they couldn't foresee the developments in NLP at the time and were just researchers not interested in making extra cash. --small correction: To the best of my knowledge, Princeton is a paying member of the LDC (will ascertain with the librarian in charge later). --as to free/non-free: I'll see whether the Princeton lawyers have an opinion, but they usually regard WordNet as too insignificant to pay attention to. Will follow up, but default may be "non-free."
Christiane
Thanks @ChristianeFellbaum , it is positive to learn that you had direct permission from the Brown Corpus compilers, because that could allow to bypass the LDC, especially if your agreement pre-dates Brown's deal with the LDC. Other factors that could have a positive effect could be if your agreement was formalized in writing, and if you used a pre-release version of Brown that could differ from the final version. The question is the right for Princeton to distribute semcor at all from the Princeton University servers, and under which terms. This question should worry the Princeton lawyers. Then, NLTK would just need to rely on the decision made at Princeton.
Regarding Brown's rights to distribute the individual text samples, the nature of the Brown Corpus—2,000-word fragments taken from a random starting point in larger texts—is a classic example of a use that is highly likely to be protected by fair use, especially in the context of its original purpose as a non-commercial, scholarly research corpus.
Hi Eric and Francis,
Princeton doesn’t distribute SemCor on the website. I don’t see it anywhere in the downloadable files that you can get to via the links on the website, and a search of the website for “semcor” doesn’t return anything. I don’t even see a link on the “Related Projects” page to nltk. I only found this, which isn’t SemCor: Python • Natural Language Toolkit has taken over the development of pywordnet. There is now a Python package, nltk_lite.wordnet, which incorporates pywordnet and which supports WordNet 2.1. It is included in NLTK Lite. I also searched the site for the word “corpus” and it comes up in a variety of pages, but none mention SemCor. There is a copy of “SemCor” on a Princeton server that also houses other downloadable files related to the WordNet project. The downloadable files linked to from the WordNet website are moving to the Princeton Data Commons data sharing platform in the near term. The Semcor files are not a part of this migration, and the Princeton server will be retired after the pertinent files are migrated.
Randee & Christiane
Thanks Christiane and Randee, it is sad to hear that you will be discontinuing the distribution of semcor, presumably due to legal concerns. Maybe we should read this as an indication that you only had an oral agreement with Profs. Francis and Kucera.
Actually, the version of semcor distributed through nltk_data is not from Princeton: it was produced by Rada Mihalcea, who mapped your original semcor1.6 to WordNet 3.0. She also mapped your work to the other WordNet versions between 1.6 and 3.0, and still distributes these packages through her download page at the University of Michigan. However, since she just refers to the Princeton license, it doesn't change anything to these packages' legal status, which still appears unclear.
Many scientific articles rely on the nltk corpus reader for the reproducibility of findings in the Brown Corpus, and removing it from nltk_data would disrupt the reproducibility of these studies. While we continue to seek clarification of the applicable licensing terms, the following could be used as a concise and truthful middle-ground:
- brown: The Brown Corpus. An LDC-licensed corpus. Provided for non-commercial, academic research use only.
- semcor: A semantically tagged corpus derived from the Brown Corpus. Provided for non-commercial, academic use only.
Hi Eric,
Certainly, the data should be accessible freely to users. But just to iterate: Princeton is NOT distributing Semcor, so while we created it, we are not responsible for its dissemination. You may want to talk to Rada Mihalcea at University of Michigan for more clarification.
All best, Christiane
From: Eric Kafe @.> Sent: Wednesday, October 15, 2025 3:17 AM To: nltk/nltk_data @.> Cc: ChristianeFellbaum @.>; Mention @.> Subject: Re: [nltk/nltk_data] Semcor License is likely invalid (Issue #250)
[https://avatars.githubusercontent.com/u/4782556?s=20&v=4]ekaf left a comment (nltk/nltk_data#250)https://github.com/nltk/nltk_data/issues/250#issuecomment-3404922084
Many scientific articles rely on the nltk corpus reader for the reproducibility of findings in the Brown Corpus, and removing it from nltk_data would disrupt the reproducibility of these studies. While we continue to seek clarification of the applicable licensing terms, the following could be used as a concise and truthful middle-ground:
- brown: The Brown Corpus. An LDC-licensed corpus. Provided for non-commercial, academic research use only.
- semcor: A semantically tagged corpus derived from the Brown Corpus. Provided for non-commercial, academic use only.
— Reply to this email directly, view it on GitHubhttps://github.com/nltk/nltk_data/issues/250#issuecomment-3404922084, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMKIFOSIZRD4QKAPVRDHHCD3XXYGZAVCNFSM6AAAAACHYVVL3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMBUHEZDEMBYGQ. You are receiving this because you were mentioned.Message ID: @.***>
Thanks Christiane, Rada answered that she would like to be able to provide some clarification, but unfortunately doesn't think she has additional information.
Now, concerning the ongoing clarification effort among nltk_data licenses since PR #242, it looks that there could be a consensus here, for modifying PR #247 by removing Semcor from the free collection, and classifying it as nonfree instead.