nltk_data Capitalization inconsistencies in NLTK's Open Multilingual Wordnet

NLTK's Open Multilingual Wordnet ("OMW") corpus data violates the principle of least surprise, in the following respect.

A user can reasonably expect that any NLTK function that:

produces a list of all lemmas given a language, and
in that list, retains the language's typical capitalisation of those lemmas

will do the same thing for every other language in OMW (if it is a language that uses capital letters).

However, NLTK violates that expectation:

import nltk
import re
capital = re.compile(r'[A-Z]')
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
    print lang, len(filter(capital.match, wn.all_lemma_names(lang=lang)))

produces:

als 14
arb 6
bul 1
cat 4139
cmn 55
dan 3
ell 314
eng 0
eus 722
fas 0
fin 30861
fra 11818
glg 4272
heb 2
hrv 2505
ind 15072
ita 2416
jpn 168
nno 1
nob 3
pol 2492
por 21850
qcn 0
slv 2888
spa 6493
swe 5
tha 0
zsm 11193

Some of the zero or very low results above are probably due to the language not generally using the letters A-Z. However, not all of those results can be accounted for in this way. Those that can not be accounted for in this way represent inconsistencies in OMW.

I have not yet investigated which kinds of inconsistency they represent. They may well represent more than one kind of inconsistency.

Dec 17 '15 15:12 ghost

In the case of English (eng), the zero result is due in part to a type inconsistency. See issue #43.

However, this rewrite of the Python code in the original ticket, which includes a workaround for that issue, and also provides type information, reveals that #43 does not entirely account for the zero result for English:

import nltk
import re
capital = re.compile(r'[A-Z]')
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
    print lang, type(list(wn.all_lemma_names(lang=lang))), len(filter(capital.match, list(wn.all_lemma_names(lang=lang))))

Result:

als <type 'list'> 14
arb <type 'list'> 6
bul <type 'list'> 1
cat <type 'list'> 4139
cmn <type 'list'> 55
dan <type 'list'> 3
ell <type 'list'> 314
eng <type 'list'> 0
eus <type 'list'> 722
fas <type 'list'> 0
fin <type 'list'> 30861
fra <type 'list'> 11818
glg <type 'list'> 4272
heb <type 'list'> 2
hrv <type 'list'> 2505
ind <type 'list'> 15072
ita <type 'list'> 2416
jpn <type 'list'> 168
nno <type 'list'> 1
nob <type 'list'> 3
pol <type 'list'> 2492
por <type 'list'> 21850
qcn <type 'list'> 0
slv <type 'list'> 2888
spa <type 'list'> 6493
swe <type 'list'> 5
tha <type 'list'> 0
zsm <type 'list'> 11193

Dec 17 '15 17:12 ghost

Tagging @fcbond, as the OMW maintainer.

Dec 17 '15 17:12 ghost

The following short program perhaps demonstrates the inconsistency more clearly, by using two different API calls to obtain lemma names:

import nltk
import re
from tabulate import tabulate
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
    if not nltkd.is_installed(corpus):
        nltk.download(corpus)

from nltk.corpus import wordnet as wn

capital = re.compile(r'[A-Z]')
table = list()

for lang in sorted(wn.langs()):
    lemma_names_with_capitals_retained = set()
    from nltk.corpus import wordnet as wn
    for aln_term in list(wn.all_lemma_names(lang=lang)):
        for synset in wn.synsets(aln_term):
            for lemma in synset.lemma_names():
                lemma_names_with_capitals_retained.add(lemma)
    table.append([lang,
        len(filter(capital.match, set(wn.all_lemma_names(lang=lang)))),
        len(filter(capital.match, lemma_names_with_capitals_retained))])

print tabulate(table,
    headers=["Language code",
        "all_lemma_names(): # w/capitals",
        "lemma_name.synset.lemma.lemma_names(): # w/capitals"])

The output (with headers condensed onto multiple lines, and column markers added):

Language | all_lemma_names(): | lemma_name.synset
code     | # w/capitals       | .lemma.lemma_names():
         |                    | # w/capitals
-------- | ------------------ | ---------------------
als      |                 14 |                   191
arb      |                  6 |                    10
bul      |                  1 |                     0
cat      |               4139 |                 10451
cmn      |                 55 |                     2
dan      |                  3 |                   257
ell      |                314 |                   398
eng      |                  0 |                 34477
eus      |                722 |                  1321
fas      |                  0 |                     0
fin      |              30861 |                 25876
fra      |              11818 |                 17412
glg      |               4272 |                  7109
heb      |                  2 |                     0
hrv      |               2505 |                  3034
ind      |              15072 |                 13361
ita      |               2416 |                  3632
jpn      |                168 |                   346
nno      |                  1 |                   207
nob      |                  3 |                   261
pol      |               2492 |                  2502
por      |              21850 |                 10465
qcn      |                  0 |                     0
slv      |               2888 |                 11062
spa      |               6493 |                  9962
swe      |                  5 |                   320
tha      |                  0 |                    68
zsm      |              11193 |                 11908

It is clear from this data that different ways of fetching lemma names from within OMW using NLTK yield very different results, at least in terms of capitalisation.

In particular, it is interesting that sometimes the first API call finds more instances of lemmas containing capital letters; and sometimes the second API call finds more. That suggests to me that this behaviour does indeed represent a bug (or perhaps a series of bugs), and is not intentional.

Dec 17 '15 19:12 ghost

Thanks. I will try to look at this as soon as I can find some time. It looks like a bug, sadly.

On Fri, Dec 18, 2015 at 3:09 AM, modem_down [email protected] wrote:

The following short program perhaps demonstrates the inconsistency more clearly, by using two different API calls to obtain lemma names:

import nltk import re from tabulate import tabulate

Install Open Multilingual Wordnet and Wordnet

if not already installed.

nltkd = nltk.downloader.Downloader() for corpus in ['wordnet','omw']: if not nltkd.is_installed(corpus): nltk.download(corpus)

from nltk.corpus import wordnet as wn

capital = re.compile(r'[A-Z]') table = list()

for lang in sorted(wn.langs()): lemma_names_with_capitals_retained = set() from nltk.corpus import wordnet as wn for aln_term in list(wn.all_lemma_names(lang=lang)): for synset in wn.synsets(aln_term): for lemma in synset.lemma_names(): lemma_names_with_capitals_retained.add(lemma) table.append([lang, len(filter(capital.match, set(wn.all_lemma_names(lang=lang)))), len(filter(capital.match, lemma_names_with_capitals_retained))])

print tabulate(table, headers=["Language code", "all_lemma_names(): # w/capitals", "lemma_name.synset.lemma.lemma_names(): # w/capitals"])

The output (with headers condensed onto multiple lines, and column markers added):

Language | all_lemma_names(): | lemma_name.synset code | # w/capitals | .lemma.lemma_names():

# w/capitals

als 14

arb 6

bul 1

cat 4139

cmn 55

dan 3

ell 314

eng 0

eus 722

fas 0

fin 30861

fra 11818

glg 4272

heb 2

hrv 2505

ind 15072

ita 2416

jpn 168

nno 1

nob 3

pol 2492

por 21850

qcn 0

slv 2888

spa 6493

swe 5

tha 0

zsm 11193

It is clear from this data that different ways of fetching lemma names from within OMW using NLTK yield very different results, at least in terms of capitalisation.

In particular, it is interesting that sometimes the first API call finds more instances of lemmas containing capital letters; and sometimes the second API call finds more. That suggests to me that this behaviour does indeed represent a bug, and is not intentional.

— Reply to this email directly or view it on GitHub https://github.com/nltk/nltk_data/issues/42#issuecomment-165551309.

	# w/capitals
als	14
arb	6
bul	1
cat	4139
cmn	55
dan	3
ell	314
eng	0
eus	722
fas	0
fin	30861
fra	11818
glg	4272
heb	2
hrv	2505
ind	15072
ita	2416
jpn	168
nno	1
nob	3
pol	2492
por	21850
qcn	0
slv	2888
spa	6493
swe	5
tha	0
zsm	11193

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

Jan 04 '16 02:01 fcbond

@fcbond thanks for this. Will the GitHub interface let you mark yourself as the "Assignee" for the issue? If not, then perhaps @stevenbird will be able to.

Jan 04 '16 02:01 ghost

@stevenbird can you either assign this to me or add me as a collaborator and I will assign it myself?

On Mon, Jan 4, 2016 at 10:51 AM, modem_down [email protected] wrote:

@fcbond https://github.com/fcbond thanks for this. Will the GitHub interface let you mark yourself as the "Assignee" for the issue? If not, then perhaps @stevenbird https://github.com/stevenbird will be able to.

— Reply to this email directly or view it on GitHub https://github.com/nltk/nltk_data/issues/42#issuecomment-168568986.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

Jan 04 '16 02:01 fcbond

@stevenbird gentle bump.

Feb 24 '16 17:02 ghost

@fcbond, @sampablokuper – done, sorry for the delay; @fcbond please assign this to yourself once you've accepted the invitation

Feb 24 '16 20:02 stevenbird

I'm not sure I see the problem here. The code counts how many of the entries begin with an ASCII capital letter, and some languages have a different number of such words. In English's case, which shows zero, I think that #43 might indeed play a role, because obviously there are some upper-case entries:

>>> wn.synsets('Bob_Dylan')[0].lemma_names()
['Dylan', 'Bob_Dylan']

But what should we expect for the other languages? It's not even the case that each language has the same number of lemmas, so I don't know why we'd expect them to have the same number of lemmas starting with upper-case ASCII letters (if that is in fact what was expected). Also there is some normalization that occurs during lookup, so you might not get back the same case as was queried:

>>> wn.synsets('bob_dylan')[0].lemma_names()
['Dylan', 'Bob_Dylan']
>>> wn.synsets('Δημοκρατία', lang='ell')[0].lemma_names(lang='ell')
['δημοκρατία']

Below I print out the language code, the count of all lemmas, the count of lemmas starting with upper-case (using str.isupper() instead of a regex match to make it unicode-capable), and a sample of up to 5 entries:

>>> for lang in sorted(wn.langs()):
...     names = set(wn.all_lemma_names(lang=lang))
...     uppers = [name for name in wn.all_lemma_names(lang=lang) if name[0].isupper()]
...     print(f'{lang} {len(names):>8} {len(uppers):>6} {uppers[:5]}')
... 
als     5988     14 ['John_Barleycorn', 'Lëndim', 'Pu', 'P', 'Lojë_me_pasa']
arb    17785      6 ['NSAID', 'I', 'Pb', 'S', 'HN']
bul     6720     10 ['Западът', ''', 'Библия', 'Южни_Щати', 'Коран']
cat    46531   4156 ['San_Angelo', 'Saint_Lucia', 'Boris_Karloff', 'Bob_Woodward', 'Rudyard_Kipling']
cmn    61532     59 ['B电池', 'T型发动机小汽车', 'T形台', 'CD', 'C电池']
dan     4468      3 ['T-shirt', 'Guds_hus', 'TOP']
ell    18225   1459 ['Κίνγκστον', 'Ουζμπεκιστάν', 'Αγκόλα', 'Νότιες_Αμερικανικές_χώρες', 'Δημοκρατία_της_Βολιβίας']
eng   147306      0 []
eus    26240    722 ['Abruzzi', 'IRA', 'Demostenes', 'Errumania', 'Job']
fas    17560      0 []
fin   129839  30876 ['San_Angelo', 'Saint_Lucia', 'Bromus_arvensis', 'Coluber-suku', 'Rasht']
fra    55350  12056 ['Apogonidae', 'Lydia_Kamakaeha', 'Central_Intelligence_Agency', 'Sous-unités_du_degré', "Liste_d'historiens_de_la_Bretagne"]
glg    23124   4278 ['San_Angelo', 'Boris_Karloff', 'Bob_Woodward', 'IRA', 'Rudyard_Kipling']
heb     5325      2 ['GAP!', 'PSEUDOGAP!']
hrv    29010   2545 ['Amazonke', 'Rom', 'Logički_sklop_NI', 'IRA', 'Phyllitis_scolopendrium']
ind    36954  15072 ['San_Angelo', 'Saint_Lucia', 'Bahasa_Mandarin', 'Bangsa_Keltik', 'Abruzzi']
ita    41855   2416 ['Bromus_arvensis', 'IRA', 'Phyllitis_scolopendrium', 'K', 'Phalaropus_fulicarius']
jpn    89637    238 ['K', 'ＣｏＡ', 'JAVA', 'Eメイル+する', 'ＡＢ型']
nld    43077   3533 ['Saint_Lucia', 'Noord-Afrika', 'Romeinse_Rijk', 'Latijn', 'Spaanse_ras_van_kleine_paarden']
nno     3387      1 ['Internett']
nob     4186      3 ['T-skjorte', 'Guds_hus', 'Internett']
pol    45387   2512 ['Układ_Słoneczny', 'Afryka_Wschodnia', 'Rom', 'Stworzyciel', 'Cycero']
por    54069  22127 ['San_Angelo', 'Cordilheira_Australiana', 'Acido_nitrico', 'Apogonidae', 'Casa_de_iorque']
qcn     3206      0 []
slv    41032   3123 ['Beli_Nil', 'Eolija', 'Rudyard_Kipling', 'Phyllitis_scolopendrium', 'K']
spa    36681   6495 ['San_Angelo', 'Bairdiella', 'Boris_Karloff', 'Bob_Woodward', 'Rudyard_Kipling']
swe     5824      5 ['Venus', 'PIN-kod', 'Europa', 'TV', 'Mars']
tha    80508      0 []
zsm    33932  11193 ['San_Angelo', 'Saint_Lucia', 'Bahasa_Mandarin', 'Abruzzi', 'Rom']

Except perhaps for the English case, I'm not sure what is the issue here, and I suggest we close it.

Oct 14 '20 08:10 goodmami