fd-dictionaries
fd-dictionaries copied to clipboard
Common ontology for part of speech
Some dictionaries already have some kind of local ontology to reliably identify
parrt of speech (and potentially gender, etc.). Examples are the WikDict
dictionaries or eng-pol. Most other dictionaries lack this information, there
the <pos/>
tag may contain arbitrary text. For machine-friendly
postprocessing, this should be mapped to an ontology, valid for all FreeDict
dictionaries.
Things to happen:
- [ ] provide common ontology
- [ ] mention in documentation that newly imported / created dictionaries need to use the ontology
- [ ] convert existing dictionaries
I'm sorry I've missed this note. Providing a common taxonomy / ontology even for the current set of databases is a formidable task, and mostly linguistic, at the core.
It would be much more practical to use an existing taxonomy. Back when I created the tagUsage mechanism for aggregating grammatical information, ISOCat was probably the hype (not an ontology, just a messy set of potentially orderly taxonomic groupings). But ISOCat is gone now, replaced by a proprietary engine aiming at something slightly different than our goals.
Another viable goal back then was the so-called GOLD ontology, created on the basis of a single comprehensive linguistic monograph, with (as far as I can recall, and this may be a false recollection) additions from various indigenous languages, coming from field workers. GOLD is not very alive nowadays, i'm afraid.
Somewhere along the way was/is the OLiA ontology, whose main mover is still very alive and kicking, so this could be worth exploring.
OR, something that has come to my mind right now and need not be the best solution for our goals, is the so-called universal tagset used by Universal Dependencies. The idealized picture would be to use each (non-universal) language-specific UD tagset and provide the (UD-supplied) mapping to the universal tagset. I can imagine two troubles with that:
- there is often no single language-specific tagset for the particular language, on the UD approach; this is because UD datasets come from numerous corpora, and each of those corpora tends to use their own tagset (sometimes standardized at the, say, 'national level', like STTS for German or CLAWS for English; except note that CLAWS comes in several variants, and many corpora of English do not use CLAWS :-)).
- dictionary makers will only extremely rarely follow a corpus-based tagset, which would mean an extra step of aligning the PoS labels from the given dictionary with the PoS labels from the given corpus tagset.
Well, then... OLiA might be the only viable solution, currently.
Maybe having a common ontology is overkill for our project. But a dictionary should list the used PoS's in its header so that people know what "s" is, because it might have been encoded as "n" somewhere else.