Capitalization inconsistencies in NLTK's Open Multilingual Wordnet
NLTK's Open Multilingual Wordnet ("OMW") corpus data violates the principle of least surprise, in the following respect.
A user can reasonably expect that any NLTK function that:
- produces a list of all lemmas given a language, and
- in that list, retains the language's typical capitalisation of those lemmas
will do the same thing for every other language in OMW (if it is a language that uses capital letters).
However, NLTK violates that expectation:
import nltk
import re
capital = re.compile(r'[A-Z]')
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
if not nltkd.is_installed(corpus):
nltk.download(corpus)
from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
print lang, len(filter(capital.match, wn.all_lemma_names(lang=lang)))
produces:
als 14
arb 6
bul 1
cat 4139
cmn 55
dan 3
ell 314
eng 0
eus 722
fas 0
fin 30861
fra 11818
glg 4272
heb 2
hrv 2505
ind 15072
ita 2416
jpn 168
nno 1
nob 3
pol 2492
por 21850
qcn 0
slv 2888
spa 6493
swe 5
tha 0
zsm 11193
Some of the zero or very low results above are probably due to the language not generally using the letters A-Z. However, not all of those results can be accounted for in this way. Those that can not be accounted for in this way represent inconsistencies in OMW.
I have not yet investigated which kinds of inconsistency they represent. They may well represent more than one kind of inconsistency.
In the case of English (eng), the zero result is due in part to a type inconsistency. See issue #43.
However, this rewrite of the Python code in the original ticket, which includes a workaround for that issue, and also provides type information, reveals that #43 does not entirely account for the zero result for English:
import nltk
import re
capital = re.compile(r'[A-Z]')
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
if not nltkd.is_installed(corpus):
nltk.download(corpus)
from nltk.corpus import wordnet as wn
for lang in sorted(wn.langs()):
print lang, type(list(wn.all_lemma_names(lang=lang))), len(filter(capital.match, list(wn.all_lemma_names(lang=lang))))
Result:
als <type 'list'> 14
arb <type 'list'> 6
bul <type 'list'> 1
cat <type 'list'> 4139
cmn <type 'list'> 55
dan <type 'list'> 3
ell <type 'list'> 314
eng <type 'list'> 0
eus <type 'list'> 722
fas <type 'list'> 0
fin <type 'list'> 30861
fra <type 'list'> 11818
glg <type 'list'> 4272
heb <type 'list'> 2
hrv <type 'list'> 2505
ind <type 'list'> 15072
ita <type 'list'> 2416
jpn <type 'list'> 168
nno <type 'list'> 1
nob <type 'list'> 3
pol <type 'list'> 2492
por <type 'list'> 21850
qcn <type 'list'> 0
slv <type 'list'> 2888
spa <type 'list'> 6493
swe <type 'list'> 5
tha <type 'list'> 0
zsm <type 'list'> 11193
Tagging @fcbond, as the OMW maintainer.
The following short program perhaps demonstrates the inconsistency more clearly, by using two different API calls to obtain lemma names:
import nltk
import re
from tabulate import tabulate
# Install Open Multilingual Wordnet and Wordnet
# if not already installed.
nltkd = nltk.downloader.Downloader()
for corpus in ['wordnet','omw']:
if not nltkd.is_installed(corpus):
nltk.download(corpus)
from nltk.corpus import wordnet as wn
capital = re.compile(r'[A-Z]')
table = list()
for lang in sorted(wn.langs()):
lemma_names_with_capitals_retained = set()
from nltk.corpus import wordnet as wn
for aln_term in list(wn.all_lemma_names(lang=lang)):
for synset in wn.synsets(aln_term):
for lemma in synset.lemma_names():
lemma_names_with_capitals_retained.add(lemma)
table.append([lang,
len(filter(capital.match, set(wn.all_lemma_names(lang=lang)))),
len(filter(capital.match, lemma_names_with_capitals_retained))])
print tabulate(table,
headers=["Language code",
"all_lemma_names(): # w/capitals",
"lemma_name.synset.lemma.lemma_names(): # w/capitals"])
The output (with headers condensed onto multiple lines, and column markers added):
Language | all_lemma_names(): | lemma_name.synset
code | # w/capitals | .lemma.lemma_names():
| | # w/capitals
-------- | ------------------ | ---------------------
als | 14 | 191
arb | 6 | 10
bul | 1 | 0
cat | 4139 | 10451
cmn | 55 | 2
dan | 3 | 257
ell | 314 | 398
eng | 0 | 34477
eus | 722 | 1321
fas | 0 | 0
fin | 30861 | 25876
fra | 11818 | 17412
glg | 4272 | 7109
heb | 2 | 0
hrv | 2505 | 3034
ind | 15072 | 13361
ita | 2416 | 3632
jpn | 168 | 346
nno | 1 | 207
nob | 3 | 261
pol | 2492 | 2502
por | 21850 | 10465
qcn | 0 | 0
slv | 2888 | 11062
spa | 6493 | 9962
swe | 5 | 320
tha | 0 | 68
zsm | 11193 | 11908
It is clear from this data that different ways of fetching lemma names from within OMW using NLTK yield very different results, at least in terms of capitalisation.
In particular, it is interesting that sometimes the first API call finds more instances of lemmas containing capital letters; and sometimes the second API call finds more. That suggests to me that this behaviour does indeed represent a bug (or perhaps a series of bugs), and is not intentional.
Thanks. I will try to look at this as soon as I can find some time. It looks like a bug, sadly.
On Fri, Dec 18, 2015 at 3:09 AM, modem_down [email protected] wrote:
The following short program perhaps demonstrates the inconsistency more clearly, by using two different API calls to obtain lemma names:
import nltk import re from tabulate import tabulate
Install Open Multilingual Wordnet and Wordnet
if not already installed.
nltkd = nltk.downloader.Downloader() for corpus in ['wordnet','omw']: if not nltkd.is_installed(corpus): nltk.download(corpus)
from nltk.corpus import wordnet as wn
capital = re.compile(r'[A-Z]') table = list()
for lang in sorted(wn.langs()): lemma_names_with_capitals_retained = set() from nltk.corpus import wordnet as wn for aln_term in list(wn.all_lemma_names(lang=lang)): for synset in wn.synsets(aln_term): for lemma in synset.lemma_names(): lemma_names_with_capitals_retained.add(lemma) table.append([lang, len(filter(capital.match, set(wn.all_lemma_names(lang=lang)))), len(filter(capital.match, lemma_names_with_capitals_retained))])
print tabulate(table, headers=["Language code", "all_lemma_names(): # w/capitals", "lemma_name.synset.lemma.lemma_names(): # w/capitals"])
The output (with headers condensed onto multiple lines, and column markers added):
Language | all_lemma_names(): | lemma_name.synset code | # w/capitals | .lemma.lemma_names():
# w/capitals als 14 arb 6 bul 1 cat 4139 cmn 55 dan 3 ell 314 eng 0 eus 722 fas 0 fin 30861 fra 11818 glg 4272 heb 2 hrv 2505 ind 15072 ita 2416 jpn 168 nno 1 nob 3 pol 2492 por 21850 qcn 0 slv 2888 spa 6493 swe 5 tha 0 zsm 11193 It is clear from this data that different ways of fetching lemma names from within OMW using NLTK yield very different results, at least in terms of capitalisation.
In particular, it is interesting that sometimes the first API call finds more instances of lemmas containing capital letters; and sometimes the second API call finds more. That suggests to me that this behaviour does indeed represent a bug, and is not intentional.
— Reply to this email directly or view it on GitHub https://github.com/nltk/nltk_data/issues/42#issuecomment-165551309.
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
@fcbond thanks for this. Will the GitHub interface let you mark yourself as the "Assignee" for the issue? If not, then perhaps @stevenbird will be able to.
@stevenbird can you either assign this to me or add me as a collaborator and I will assign it myself?
On Mon, Jan 4, 2016 at 10:51 AM, modem_down [email protected] wrote:
@fcbond https://github.com/fcbond thanks for this. Will the GitHub interface let you mark yourself as the "Assignee" for the issue? If not, then perhaps @stevenbird https://github.com/stevenbird will be able to.
— Reply to this email directly or view it on GitHub https://github.com/nltk/nltk_data/issues/42#issuecomment-168568986.
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
@stevenbird gentle bump.
@fcbond, @sampablokuper – done, sorry for the delay; @fcbond please assign this to yourself once you've accepted the invitation
I'm not sure I see the problem here. The code counts how many of the entries begin with an ASCII capital letter, and some languages have a different number of such words. In English's case, which shows zero, I think that #43 might indeed play a role, because obviously there are some upper-case entries:
>>> wn.synsets('Bob_Dylan')[0].lemma_names()
['Dylan', 'Bob_Dylan']
But what should we expect for the other languages? It's not even the case that each language has the same number of lemmas, so I don't know why we'd expect them to have the same number of lemmas starting with upper-case ASCII letters (if that is in fact what was expected). Also there is some normalization that occurs during lookup, so you might not get back the same case as was queried:
>>> wn.synsets('bob_dylan')[0].lemma_names()
['Dylan', 'Bob_Dylan']
>>> wn.synsets('Δημοκρατία', lang='ell')[0].lemma_names(lang='ell')
['δημοκρατία']
Below I print out the language code, the count of all lemmas, the count of lemmas starting with upper-case (using str.isupper() instead of a regex match to make it unicode-capable), and a sample of up to 5 entries:
>>> for lang in sorted(wn.langs()):
... names = set(wn.all_lemma_names(lang=lang))
... uppers = [name for name in wn.all_lemma_names(lang=lang) if name[0].isupper()]
... print(f'{lang} {len(names):>8} {len(uppers):>6} {uppers[:5]}')
...
als 5988 14 ['John_Barleycorn', 'Lëndim', 'Pu', 'P', 'Lojë_me_pasa']
arb 17785 6 ['NSAID', 'I', 'Pb', 'S', 'HN']
bul 6720 10 ['Западът', ''', 'Библия', 'Южни_Щати', 'Коран']
cat 46531 4156 ['San_Angelo', 'Saint_Lucia', 'Boris_Karloff', 'Bob_Woodward', 'Rudyard_Kipling']
cmn 61532 59 ['B电池', 'T型发动机小汽车', 'T形台', 'CD', 'C电池']
dan 4468 3 ['T-shirt', 'Guds_hus', 'TOP']
ell 18225 1459 ['Κίνγκστον', 'Ουζμπεκιστάν', 'Αγκόλα', 'Νότιες_Αμερικανικές_χώρες', 'Δημοκρατία_της_Βολιβίας']
eng 147306 0 []
eus 26240 722 ['Abruzzi', 'IRA', 'Demostenes', 'Errumania', 'Job']
fas 17560 0 []
fin 129839 30876 ['San_Angelo', 'Saint_Lucia', 'Bromus_arvensis', 'Coluber-suku', 'Rasht']
fra 55350 12056 ['Apogonidae', 'Lydia_Kamakaeha', 'Central_Intelligence_Agency', 'Sous-unités_du_degré', "Liste_d'historiens_de_la_Bretagne"]
glg 23124 4278 ['San_Angelo', 'Boris_Karloff', 'Bob_Woodward', 'IRA', 'Rudyard_Kipling']
heb 5325 2 ['GAP!', 'PSEUDOGAP!']
hrv 29010 2545 ['Amazonke', 'Rom', 'Logički_sklop_NI', 'IRA', 'Phyllitis_scolopendrium']
ind 36954 15072 ['San_Angelo', 'Saint_Lucia', 'Bahasa_Mandarin', 'Bangsa_Keltik', 'Abruzzi']
ita 41855 2416 ['Bromus_arvensis', 'IRA', 'Phyllitis_scolopendrium', 'K', 'Phalaropus_fulicarius']
jpn 89637 238 ['K', 'CoA', 'JAVA', 'Eメイル+する', 'AB型']
nld 43077 3533 ['Saint_Lucia', 'Noord-Afrika', 'Romeinse_Rijk', 'Latijn', 'Spaanse_ras_van_kleine_paarden']
nno 3387 1 ['Internett']
nob 4186 3 ['T-skjorte', 'Guds_hus', 'Internett']
pol 45387 2512 ['Układ_Słoneczny', 'Afryka_Wschodnia', 'Rom', 'Stworzyciel', 'Cycero']
por 54069 22127 ['San_Angelo', 'Cordilheira_Australiana', 'Acido_nitrico', 'Apogonidae', 'Casa_de_iorque']
qcn 3206 0 []
slv 41032 3123 ['Beli_Nil', 'Eolija', 'Rudyard_Kipling', 'Phyllitis_scolopendrium', 'K']
spa 36681 6495 ['San_Angelo', 'Bairdiella', 'Boris_Karloff', 'Bob_Woodward', 'Rudyard_Kipling']
swe 5824 5 ['Venus', 'PIN-kod', 'Europa', 'TV', 'Mars']
tha 80508 0 []
zsm 33932 11193 ['San_Angelo', 'Saint_Lucia', 'Bahasa_Mandarin', 'Abruzzi', 'Rom']
Except perhaps for the English case, I'm not sure what is the issue here, and I suggest we close it.