Discuss POS
Current POS types don't suit all situations. When converting from other dictionary format, I often encounter some POS types not in ODict or some types I cannot decide. In WordNet, there are only 10 types:
<!-- The part of speech values are as follows:
n: Noun
v: Verb
a: Adjective
r: Adverb
s: Adjective Satellite
z: Multiword expression (inc. phrase, idiom)
c: Conjunction
p: Adposition (Preposition, postposition, etc.)
x: Other (inc. particle, classifier, bound morpheme, determiner)
u: Unknown -->
I don't know how many types Wiktionary has, but it is definitely more than 10. In ODict, we have over 100 types, but they still can't cover all situation in conversion. Adding more types cannot solve this problem, because there is not a standard of POS. Different dictionary uses different POS types, and not to mention different languages.
WordNet is for NLP, I believe it has a reasonable minimal set of POS. And I think the Other type in it is perfect for our problem here. Maybe we can add a new fallback enum other(&str) to map all non-standard POS to it and retain the POS information instead of mapping to unknown. But I don't know if it's easy to represent this type in bindings other than Rust. Here is the first thing to discuss.
Second, I found some POS missing during Wiktionary conversion. Please check if my mapping is correct, and decide whether to add some of them into ODict. For example, I think "stem" is reasonable to add, because we have "affix" already and they are related concepts for compound.
@@ -25,6 +30,10 @@ pos_map = {
"article": "art",
"character": "chr",
"circumfix": "cf",
+ "circumpos": "cf",
+ "classifier": "cls",
+ "combining_form": "un", # should be stem, but we don't have the type
+ "contraction": "contr",
"infix": "inf",
"interfix": "intf",
"noun": "n",
@@ -32,10 +41,19 @@ pos_map = {
"phrase": "phr",
"prefix": "pref",
"prep_phrase": "phr_prep",
+ "proverb": "prov",
"punct": "punc",
"suffix": "suff",
"symbol": "sym",
"verb": "v",
+
+ # found in jpn
+ "adnominal": "adj_pn",
+ "counter": "ctr",
+ "romanization": "un", # TBD
+ "root": "un",
+ "soft-redirect": "un",
+ "syllable": "un",
}
That's all for now, I'll add more about POS if I find more later.
Hey @jaxvanyang! Thanks for raising this. I agree, and it's something I've deliberated over, especially with the recently added FormKind and PronunciationKind enums, which face a similar dilemma. I do think the most idiomatic way (short of just allowing any kind of raw string) is to use an Other(String) enum variant, but as you said, this doesn't really map well to other languages that don't offer similar features.
Maybe we could try having POS (and similar enums) be a union of <Enum>|string in other languages? Or is that too confusing? Happy to hear your thoughts!
In the meantime, will work to get these new POS added.
From what I see in the converter, the Python binding doesn't really use the enum. POS is just string in current Python binding, and actually it accepts any value. Because the dict object is first converted to XML, XML accepts any value. But then when converting to ODict, it will check the value and fail. From this, I think keeping POS being string in other language is a good idea, we just need to add a new logic in the XML to ODict conversion in Rust, and the other languages will automatically benefit from that. How do you think?
EDIT: I thinks we should also provide a way to check if the value is standard, like a pair of functions (or methods) is_other and is_standard.
Yeah.. I was planning to change that aspect of the Python API 😅 As I mentioned on your PR I think, I do think using enums in unions could be good here. Also, I realize my last comment didn't format correctly.
I was suggesting we do something like this:
pos: Union[PartOfSpeech, str]
where PartOfSpeech is a ported Python enum of all of the Rust POS values, and you can check to see if something is an official POS tag via:
isinstance(value, PartOfSpeech):
otherwise you can assume it's just a string of a custom value. Can do something similar in Node too, seeing JS also isn't statically typed and TS supports unions too.
Actually, according to the PyO3 website, they do actually support tuple enum variants, so doing an Other(String) variant in Python at least seems entirely doable. No idea yet though how accessing this from the Python side would look like though, seeing I don't think Python as a language supports this concept. Gotta do some reading I guess.
I forgot to mention that keeping it string is good because we don't have to change anything in python. The union way is definitely better, while the native enum variant is the best for API consistency. I think they are all good. But in either way, we have to implement the Rust side first. Maybe we can draft a PR to proceed and test the ideas.
For sure – I think the Rust side is definitely doable in the interim and could get started on a PR, unless you wanted to.
Though I'm inclined, in that case, to make PronunciationKind and FormKind strings in the other languages too, otherwise it'd be weird for some to be enums and some to not be. Curious if we kept with a string approach how we could enforce validity, like you raised. Another option is to have JS and Python use wrapper classes, something like this:
The string way cannon guarantee correctness. In my assumption, it just relies on the user to make it valid at best effort. I think the wrapper way is easier to understand, but as I said before they are all good. It's up to you.
Got an initial PR open: https://github.com/TheOpenDictionary/odict/pull/1203