Standardized POS/Features with Universal Dependency
Hey folks – opening a parent issue to openly discuss something that has been kicking around the back of my mind lately. Currently in order to achieve maximum flexibility, Sense blocks, along with some others like Form, allow you to pass <tag> children to denote any modifiers to the lexical component. For example, you may pass <tag>female</tag> for the sense "actrice" in French.
However, I'm thinking for the sake of clarity it may be worth it to explicitly allow attributes for these blocks instead, leveraging UD's official feature list to use some kind of standard for these.
For example, <sense gender="masc" definite="spec" /> to denote gender and definite. In a similar vein, it would be nice to leverage UD's official POS tags for ODict's PartOfSpeech, though the POS tags supported by ODict are currently much more expansive than what UD offers (such as the many Japanese-specific part of speech tags).
Wondering what everyone's thoughts are on this and whether it is worth the change!
"masc" seems too much. Why not a single letter (m, f, mf, n, p)?
Also, this question is more generic, but still applies to genders and tags. In Russian, for instance, and other non-latin alphabets, how would you deal with native POS/gender? I mean, the "noun" POS cannot/souldn't be used when the destination language is non English (I write "cannot/shouldn't" but that's only one opinion, I am only asking for guidance here :).
I think this a wonderful idea. Perhaps we should have a more permissive schema (perhaps use an xml namespace?)
<sense pos="n">
<feature name="ud:Gender" value="Masc" />
</sense>
Much of the special Japanese POS already fit this framework, and the rest are just specializations of the above and can be encoded like the id proposal in #1303.
<sense pos="v">
<feature name="ja:Dan" value="5">Godan</feature>
</sense>
Hey folks! Sorry I've been busy the past couple weeks – just returning this.
@BoboTiG The purpose in choosing "masc", for example, as it fits under the values provided by UD's specification. The idea here would be to align the values used in UD with those in ODict to keep ODict in line with existing standards.
@Waelwindows Hmm.. I think my hesitance around keeping things too flexible is that for dictionary clients, how to present this data becomes somewhat ambiguous. For example, if the name attribute of feature is a string, then how would a client be able to reasonably distinguish between what is describing a gender vs. part of speech? Also, in the examples you provided, the ud specifier in one case is describing the POS framework, whereas ja is describing a language. I worry it might be too ambiguous here. What are your thoughts?