stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Wrong genders in Romanian

Open jonnyGitHub57 opened this issue 10 months ago • 5 comments

Describe the bug When tokenizing neuter words in Romanian, they are tagged as "Gender=Masc"

To Reproduce Analyze a sentence such as "Sistemul este foarte bun". The neuter noun "sistem" appears as:

Input sentence: Sistemul este foarte bun [ [ { "id": 1, "text": "Sistemul", "lemma": "sistem", "upos": "NOUN", "xpos": "Ncmsry", "feats": "Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing", "head": 4, "deprel": "nsubj", "start_char": 0, "end_char": 8 }, {

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • OS: [e.g. Windows, Ubuntu, CentOS, MacOS]
  • Python version: [e.g. Python 3.6.8 from Anaconda]
  • Stanza version: [e.g., 1.0.0]

Additional context Add any other context about the problem here.

jonnyGitHub57 avatar Jan 19 '25 16:01 jonnyGitHub57

When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks:

[john@localhost UD_Romanian-RRT]$ grep Sistemul  *conllu | grep -v "# text"
ro_rrt-ud-test.conllu:1 Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       2       nsubj   _       _
ro_rrt-ud-train.conllu:29       Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       27      nmod    _       _
ro_rrt-ud-train.conllu:15       Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       12      nmod    _       _
ro_rrt-ud-train.conllu:1        Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       7       nsubj:pass      _      _
ro_rrt-ud-train.conllu:1        Sistemul        sistem  NOUN    Ncmsry  Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing       7       nsubj   _       _
[john@localhost UD_Romanian-SiMoNERo]$ grep Sistemul  *conllu | grep -v "# text"
ro_simonero-ud-train.conllu:1   Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   7       nsubj   _       _
ro_simonero-ud-train.conllu:11  Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   9       nmod    _       _
ro_simonero-ud-train.conllu:29  Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   27      obl:agent       _       _
ro_simonero-ud-train.conllu:40  Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   29      conj    _       _
ro_simonero-ud-train.conllu:1   Sistemul        sistem  NOUN    Ncmsry  Case=Nom|Definite=Def|Gender=Masc|Number=Sing   3       nsubj   _       _

AngledLuffa avatar Jan 19 '25 23:01 AngledLuffa

Dear John,

Many thanks for you fast reply. I almost suspected something like that.

BR,

/Jonny

20 januari 2025, 00:16 centraleuropeisk normaltid, skrev John @.***>:

When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks: @.*** UD_Romanian-RRT]$ grep Sistemul conllu | grep -v "# text" ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _ ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _ ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ @.** UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text" ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _ ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _ ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _ ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2601061972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO2EVKLYV57WZ2BXSDT2LQW6BAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGA3DCOJXGI. You are receiving this because you authored the thread.Message ID: @.***>

jonnyGitHub57 avatar Jan 20 '25 08:01 jonnyGitHub57

Dear John, I found this information on Romanian UD https://universaldependencies.org/ro/index.html that it is on purpose that "neute gender" are not used for nouns. Nominal Features

Den måndag 20 januari 2025 kl. 09:33:10 +01:00, skrev @.***>:

Dear John,

Many thanks for you fast reply. I almost suspected something like that.

BR,

/Jonny

20 januari 2025, 00:16 centraleuropeisk normaltid, skrev John @.***>:

When I look for that word in the training data, it is labeled Gender=Masc in both of the bigger Romanian treebanks: @.*** UD_Romanian-RRT]$ grep Sistemul conllu | grep -v "# text" ro_rrt-ud-test.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 2 nsubj _ _ ro_rrt-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 27 nmod _ _ ro_rrt-ud-train.conllu:15 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 12 nmod _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj:pass _ _ ro_rrt-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Acc,Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ @.** UD_Romanian-SiMoNERo]$ grep Sistemul *conllu | grep -v "# text" ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 7 nsubj _ _ ro_simonero-ud-train.conllu:11 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 9 nmod _ _ ro_simonero-ud-train.conllu:29 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 27 obl:agent _ _ ro_simonero-ud-train.conllu:40 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 29 conj _ _ ro_simonero-ud-train.conllu:1 Sistemul sistem NOUN Ncmsry Case=Nom|Definite=Def|Gender=Masc|Number=Sing 3 nsubj _ _

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2601061972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO2EVKLYV57WZ2BXSDT2LQW6BAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGA3DCOJXGI. You are receiving this because you authored the thread.Message ID: @.***>

jonnyGitHub57 avatar Jan 20 '25 09:01 jonnyGitHub57

Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian

AngledLuffa avatar Mar 24 '25 21:03 AngledLuffa

No, there is nothing to do since it seems to be a consious decision to not have neuter gender in the UD.

Thanks for the reminder,

/Jonny

On Monday, 24 March 2025 at 22:00:38 +01:00, John Bauer @.***> wrote:

Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2749384149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO6E7X7S3V5E3MR5POL2WBW7NAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBZGM4DIMJUHE. You are receiving this because you authored the thread.Message ID: @.***>AngledLuffaAngledLuffa left a comment (stanfordnlp/stanza#1449) https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2749384149 Is the conclusion that there's nothing to be done? It would basically require an overhaul of the dataset or a special case of some kind for Romanian

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1449#issuecomment-2749384149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXUYLO6E7X7S3V5E3MR5POL2WBW7NAVCNFSM6AAAAABVO2SSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBZGM4DIMJUHE. You are receiving this because you authored the thread.Message ID: @.***>

jonnyGitHub57 avatar Mar 25 '25 19:03 jonnyGitHub57