stanza
stanza copied to clipboard
Wrong genders in Romanian
Describe the bug When tokenizing neuter words in Romanian, they are tagged as "Gender=Masc"
To Reproduce Analyze a sentence such as "Sistemul este foarte bun". The neuter noun "sistem" appears as:
Input sentence: Sistemul este foarte bun [ [ { "id": 1, "text": "Sistemul", "lemma": "sistem", "upos": "NOUN", "xpos": "Ncmsry", "feats": "Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing", "head": 4, "deprel": "nsubj", "start_char": 0, "end_char": 8 }, {
Expected behavior A clear and concise description of what you expected to happen.
Environment (please complete the following information):
- OS: [e.g. Windows, Ubuntu, CentOS, MacOS]
- Python version: [e.g. Python 3.6.8 from Anaconda]
- Stanza version: [e.g., 1.0.0]
Additional context Add any other context about the problem here.
When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks:
[john@localhost UD_Romanian-RRT]$ grep Sistemul *conllu | grep -v "# text"
ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _
ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _
ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _
ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _
ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _
[john@localhost UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text"
ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _
ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _
ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _
ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _
ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _
Dear John,
Many thanks for you fast reply. I almost suspected something like that.
BR,
/Jonny
20 januari 2025, 00:16 centraleuropeisk normaltid, skrev John @.***>:
When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks: @.*** UD_Romanian-RRT]$ grep Sistemul conllu | grep -v "# text" ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _ ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _ ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ @.** UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text" ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _ ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _ ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _ ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2601061972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO2EVKLYV57WZ2BXSDT2LQW6BAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGA3DCOJXGI. You are receiving this because you authored the thread.Message ID: @.***>
Dear John, I found this information on Romanian UD https://universaldependencies.org/ro/index.html that it is on purpose that "neute gender" are not used for nouns. Nominal Features
-
Nominal words (NOUN https://universaldependencies.org/u/pos/NOUN.html,PROPN https://universaldependencies.org/u/pos/PROPN.htmlandPRON https://universaldependencies.org/u/pos/PRON.html) have an inherentGender https://universaldependencies.org/u/feat/Gender.htmlfeature with one of two values:MascorFem. The neuter is in Romanian classified as masculine singular and feminine plural.
- The following parts of speech inflect forGenderbecause they must agree with nouns:ADJ https://universaldependencies.org/u/pos/ADJ.html,DET https://universaldependencies.org/u/pos/DET.html,NUM https://universaldependencies.org/u/pos/NUM.html,VERB https://universaldependencies.org/u/pos/VERB.html,AUX https://universaldependencies.org/u/pos/AUX_.html. For verbs (including auxiliaries), only participles have gender.
Den måndag 20 januari 2025 kl. 09:33:10 +01:00, skrev @.***>:
Dear John,
Many thanks for you fast reply. I almost suspected something like that.
BR,
/Jonny
20 januari 2025, 00:16 centraleuropeisk normaltid, skrev John @.***>:
When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks: @.*** UD_Romanian-RRT]$ grep Sistemul conllu | grep -v "# text" ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _ ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _ ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ @.** UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text" ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _ ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _ ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _ ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2601061972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO2EVKLYV57WZ2BXSDT2LQW6BAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGA3DCOJXGI. You are receiving this because you authored the thread.Message ID: @.***>
Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian
No, there is nothing to do since it seems to be a consious decision to not have neuter gender in the UD.
Thanks for the reminder,
/Jonny
On Monday, 24 March 2025 at 22:00:38 +01:00, John Bauer @.***> wrote:
Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2749384149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO6E7X7S3V5E3MR5POL2WBW7NAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBZGM4DIMJUHE. You are receiving this because you authored the thread.Message ID: @.***>AngledLuffaAngledLuffa left a comment (stanfordnlp/stanza#1449) https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2749384149 Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2749384149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO6E7X7S3V5E3MR5POL2WBW7NAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBZGM4DIMJUHE. You are receiving this because you authored the thread.Message ID: @.***>