nltk_data icon indicating copy to clipboard operation
nltk_data copied to clipboard

Different API calls for obtaining all lemma names in NLTK's Open Multilingual Wordnet produce inconsistent results

Open ghost opened this issue 10 years ago • 3 comments

This bug appears to be related to #42, but is of a more general character.

import nltk
from tabulate import tabulate
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn

table = list()

for lang in sorted(wn.langs()):
    my_set_of_all_lemma_names = set()
    from nltk.corpus import wordnet as wn
    for aln_term in list(wn.all_lemma_names(lang=lang)):
        for synset in wn.synsets(aln_term):
            for lemma in synset.lemma_names():
                my_set_of_all_lemma_names.add(lemma)
    table.append([lang,
        len(set(wn.all_lemma_names(lang=lang))),
        len(my_set_of_all_lemma_names)])

print tabulate(table,
    headers=["Language code",
        "all_lemma_names()",
        "lemma_name.synset.lemma.lemma_names()"])

produces (with headers condensed onto multiple lines, and column markers added):

Language | all_lemma_names() | lemma_name.synset
code     |                   | .lemma.lemma_names()
-------- | ----------------- | --------------------
als      |              5988 |                 2477
arb      |             17785 |                   54
bul      |              6720 |                    0
cat      |             46534 |                24368
cmn      |             61532 |                   13
dan      |              4468 |                 4336
ell      |             18229 |                  800
eng      |            147306 |               148730
eus      |             26242 |                 6055
fas      |             17560 |                    0
fin      |            129839 |                49042
fra      |             55350 |                45367
glg      |             23125 |                12893
heb      |              5325 |                    0
hrv      |             29010 |                 8596
ind      |             36954 |                21780
ita      |             41855 |                13225
jpn      |             89637 |                 1028
nno      |              3387 |                 3255
nob      |              4186 |                 3678
pol      |             45387 |                10844
por      |             54069 |                21889
qcn      |              3206 |                    0
slv      |             40236 |                25363
spa      |             36681 |                20922
swe      |              5824 |                 4640
tha      |             80508 |                  622
zsm      |             33932 |                19253

As with #42, it is interesting that sometimes the first API call finds more lemma names; and sometimes the second API call finds more. That again suggests to me that this behaviour does indeed represent a bug (or perhaps a series of bugs), and is not intentional.

ghost avatar Dec 17 '15 19:12 ghost

@fcbond: are you aware of this issue?

stevenbird avatar Aug 20 '16 20:08 stevenbird

Yes. I have just started Sabbatical (at UW) so hope to be working on these issues soon.

On Sat, Aug 20, 2016 at 1:42 PM, Steven Bird [email protected] wrote:

@fcbond https://github.com/fcbond: are you aware of this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/issues/44#issuecomment-241222602, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xoH7b0198DANNwQyPl0JTu8VTtPzks5qh2angaJpZM4G3nwJ .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

fcbond avatar Aug 20 '16 23:08 fcbond

This is not a problem. The error is in the way @sampablokuper retrieved synsets for lemmas (but I would sympathize that the API for retrieving non-English entries can be confusing at times). If we specify the language when requesting synsets and lemmas, as follows:

 for lang in sorted(wn.langs()):
     my_set_of_all_lemma_names = set()
-    from nltk.corpus import wordnet as wn
     for aln_term in list(wn.all_lemma_names(lang=lang)):
-        for synset in wn.synsets(aln_term):
+        for synset in wn.synsets(aln_term, lang=lang):
-            for lemma in synset.lemma_names():
+            for lemma in synset.lemma_names(lang=lang):
                my_set_of_all_lemma_names.add(lemma)

...then we get expected counts:

Language code      all_lemma_names()    lemma_name.synset.lemma.lemma_names()
---------------  -------------------  ---------------------------------------
als                             5988                                     5988
arb                            17785                                    17785
bul                             6720                                     6720
cat                            46531                                    46531
cmn                            61532                                    61523
dan                             4468                                     4468
ell                            18225                                    18225
eng                           147306                                   148730
eus                            26240                                    26240
fas                            17560                                    17560
fin                           129839                                   129839
fra                            55350                                    55350
glg                            23124                                    23123
heb                             5325                                     5325
hrv                            29010                                    28996
ind                            36954                                    36954
ita                            41855                                    41855
jpn                            89637                                    89637
nld                            43077                                    43077
nno                             3387                                     3387
nob                             4186                                     4186
pol                            45387                                    45387
por                            54069                                    54069
qcn                             3206                                     3206
slv                            41032                                    41032
spa                            36681                                    36681
swe                             5824                                     5824
tha                            80508                                    80508
zsm                            33932                                    33932

There are a few discrepancies because there are bugs in the NLTK. When I caught ValueError, WordNetError, StopIteration, and KeyError around the wn.synsets() call, I was able to produce the table above. English is the only one with more lemmas with the loop method than with all_lemma_names(), and it's also the only language with morphy for morphological query expansion, and I think these facts are related, but I didn't confirm.

In any case I think this issue can be closed.

goodmami avatar Oct 14 '20 07:10 goodmami