Different API calls for obtaining all lemma names in NLTK's Open Multilingual Wordnet produce inconsistent results
This bug appears to be related to #42, but is of a more general character.
import nltk
from tabulate import tabulate
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
if not nltkd.is_installed(corpus):
nltk.download(corpus)
from nltk.corpus import wordnet as wn
table = list()
for lang in sorted(wn.langs()):
my_set_of_all_lemma_names = set()
from nltk.corpus import wordnet as wn
for aln_term in list(wn.all_lemma_names(lang=lang)):
for synset in wn.synsets(aln_term):
for lemma in synset.lemma_names():
my_set_of_all_lemma_names.add(lemma)
table.append([lang,
len(set(wn.all_lemma_names(lang=lang))),
len(my_set_of_all_lemma_names)])
print tabulate(table,
headers=["Language code",
"all_lemma_names()",
"lemma_name.synset.lemma.lemma_names()"])
produces (with headers condensed onto multiple lines, and column markers added):
Language | all_lemma_names() | lemma_name.synset
code | | .lemma.lemma_names()
-------- | ----------------- | --------------------
als | 5988 | 2477
arb | 17785 | 54
bul | 6720 | 0
cat | 46534 | 24368
cmn | 61532 | 13
dan | 4468 | 4336
ell | 18229 | 800
eng | 147306 | 148730
eus | 26242 | 6055
fas | 17560 | 0
fin | 129839 | 49042
fra | 55350 | 45367
glg | 23125 | 12893
heb | 5325 | 0
hrv | 29010 | 8596
ind | 36954 | 21780
ita | 41855 | 13225
jpn | 89637 | 1028
nno | 3387 | 3255
nob | 4186 | 3678
pol | 45387 | 10844
por | 54069 | 21889
qcn | 3206 | 0
slv | 40236 | 25363
spa | 36681 | 20922
swe | 5824 | 4640
tha | 80508 | 622
zsm | 33932 | 19253
As with #42, it is interesting that sometimes the first API call finds more lemma names; and sometimes the second API call finds more. That again suggests to me that this behaviour does indeed represent a bug (or perhaps a series of bugs), and is not intentional.
@fcbond: are you aware of this issue?
Yes. I have just started Sabbatical (at UW) so hope to be working on these issues soon.
On Sat, Aug 20, 2016 at 1:42 PM, Steven Bird [email protected] wrote:
@fcbond https://github.com/fcbond: are you aware of this issue?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/issues/44#issuecomment-241222602, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xoH7b0198DANNwQyPl0JTu8VTtPzks5qh2angaJpZM4G3nwJ .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
This is not a problem. The error is in the way @sampablokuper retrieved synsets for lemmas (but I would sympathize that the API for retrieving non-English entries can be confusing at times). If we specify the language when requesting synsets and lemmas, as follows:
for lang in sorted(wn.langs()):
my_set_of_all_lemma_names = set()
- from nltk.corpus import wordnet as wn
for aln_term in list(wn.all_lemma_names(lang=lang)):
- for synset in wn.synsets(aln_term):
+ for synset in wn.synsets(aln_term, lang=lang):
- for lemma in synset.lemma_names():
+ for lemma in synset.lemma_names(lang=lang):
my_set_of_all_lemma_names.add(lemma)
...then we get expected counts:
Language code all_lemma_names() lemma_name.synset.lemma.lemma_names()
--------------- ------------------- ---------------------------------------
als 5988 5988
arb 17785 17785
bul 6720 6720
cat 46531 46531
cmn 61532 61523
dan 4468 4468
ell 18225 18225
eng 147306 148730
eus 26240 26240
fas 17560 17560
fin 129839 129839
fra 55350 55350
glg 23124 23123
heb 5325 5325
hrv 29010 28996
ind 36954 36954
ita 41855 41855
jpn 89637 89637
nld 43077 43077
nno 3387 3387
nob 4186 4186
pol 45387 45387
por 54069 54069
qcn 3206 3206
slv 41032 41032
spa 36681 36681
swe 5824 5824
tha 80508 80508
zsm 33932 33932
There are a few discrepancies because there are bugs in the NLTK. When I caught ValueError, WordNetError, StopIteration, and KeyError around the wn.synsets() call, I was able to produce the table above. English is the only one with more lemmas with the loop method than with all_lemma_names(), and it's also the only language with morphy for morphological query expansion, and I think these facts are related, but I didn't confirm.
In any case I think this issue can be closed.