stanza
stanza copied to clipboard
German ordinal numbers leading to split sentences
Describe the bug In German, ordinal numbers have a dot after the number:
- On the 1st day --> Am 1. Tag
- The 23rd item --> Der 23. Eintrag
Furthermore, the day-of-the-month part of a date is always an ordinal number:
- September 30 --> Der 30. September (i.e. "the thirtieth September")
It seems that under some circumstances, the default German model of Stanza is confused by the ordinal numbers dot:
Am 20. März 1905 wurde der FC Chelsea gegründet.
This sentence leads to the words [['am', '20', '.'], ['März', '1905', 'wurde', 'der', 'FC', 'Chelsea', 'gegründet']], i.e. Stanza detects two sentences that both do not make sense on their own.
Am 21. Juni beginnt der Sommer.
This sentence (sometimes!) leads to the words [['Am', '21', '.'], ['Juni', 'beginnt', 'der', 'Sommer']]. Sometimes this returns the correct result (just one sentence), but sometimes not. Not sure if the Stanza model is deterministic, or if there is some random factor included.
To Reproduce
Steps to reproduce the behavior:
import stanza
stanza.download('de')
nlp = stanza.Pipeline('de')
nlp('Am 20. März 1905 wurde der FC Chelsea gegründet.')
nlp('Am 21. Juni beginnt der Sommer.')
Expected behavior Since "Am 20. März 1905 wurde der FC Chelsea gegründet." is one correct German sentence, we expect Stanza to return the POS tags for just one sentence, instead of two.
Environment (please complete the following information):
- OS: Ubuntu 20.04.1 under Windows 10 (WSL).
- Python version: 3.8.5 under virtualenv
- Stanza version: 1.2
Would it be useful to add a multi-word tokenization here? There could be a way to ensure that number and dot coexist as one token.
MWT is an interesting idea, but it's quite different from the existing annotation standard in GSD. For dates, it keeps it as one token, and for lists it splits it into two tokens.
Overall I would have to say the problem is because . with the space is the end of a sentence a huge portion of the time in the training data. Characters seems to be insufficient for reliably distinguishing periods which can continue a sentence and periods which can't. One possibility would be a model which looks at more of the surrounding word context, not just the characters.
There are other scenarios where a dot . with a space is not the end of a sentence, e.g. abbreviations ("U.S. Marshal", "at 4 p.m. or later"). If Stanza is able to parse those, then it seems you have enough of them in the training data.
Why not just add some German sentences with ordinal numbers to the training data?
Why not just add some German sentences with ordinal numbers to the training data?
None of us speak German
I can gladly provide some examples. How many do you need, approximately?
I assume you already have some German sentences, otherwise you could not train a German model. So, you only need some additional examples with ordinal numbers, right?
Yeah, the current default German dataset is here:
https://github.com/UniversalDependencies/UD_German-GSD
There are 23 examples of regex [0-9][.] which doesn't end a sentence, and a few hundred which do, so you see why it would frequently make the wrong choice. I can't really pretend to know the right answer, but it's going to be more than a handful and less than War & Peace. Probably one for each numeral and one for each day / month is a good start, although we might actually need the product rather than the sum of those numbers. Autogenerating sentences is probably bad, as we might accidentally introduce some biases where it only handles the new splits correctly in the exact same context as the autogenerated sentences. It doesn't need a full pos/dep/etc treatment; we can train the tokenizer separately from the rest. Anyway, the point is I'm happy to rebuild models and report the results a few times as you come up with new text.
When you say, it does not need the full "pos/dep/etc" treatment, what does that mean exactly? How much of the following data do I need to supply?

None of it. All we need is the tokenization and any possible MWT.
Just to be sure, for the example above, "Manasse ist ein einzigartiger Parfümeur.", you just need
Manasse
ist
ein
einzigartiger
Parfümeur
.
For Multi-Word Tokens (i.e. several words packed into one), you just need an expansion, i.e. the individual words?
correct
On Mon, Nov 15, 2021, 12:21 AM Yolp @.***> wrote:
Just to be sure, for the example above, "Manasse ist ein einzigartiger Parfümeur.", you just need
Manasse
ist
ein
einzigartiger
Parfümeur
.
For Multi-Word Tokens (i.e. several words packed into one), you just need an expansion, i.e. the individual words?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/617#issuecomment-968641538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOT3WACQCUTWWX26L3UMC7IBANCNFSM4XI437SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
And, the correct tokenization for an ordinal number is to leave the dot next to the number, right?
Am
23.
Oktober
habe
ich
Geburtstag
.
Is 23. an MWT, because it is composed of the number and the dot? Or should it be treated as a single token?
Here my results so far. Tokenization is mostly trivial (split spaces and commas, don't split dots).
In what format should I submit them?
Do we need much more?
Please note, I'm a native German speaker, but I cannot guarantee that the spelling and punctuation is 100% correct.
Am 23. Oktober habe ich Geburtstag.
Der 2. Weltkrieg begann am 1. September 1939.
Der 1. Mai ist der Tag der Arbeit.
Der 29. Februar existiert nur in einem Schaltjahr.
Am 2. Tag begannen sie mit der Arbeit.
Gott ruhte am 7. Tag.
Am 20. März 1905 wurde der FC Chelsea gegründet.
Am 21. Juni beginnt der Sommer.
Beim 12. Versuch waren wir zum 1. Mal erfolgreich.
Heiligabend ist am 24., Weihnachten am 25. und Stephanstag am 26. Dezember.
Kommst du am 14. März oder am 15.?
Kommst du am 14. oder am 15. März?
Am 6. Dezember ist St. Nikolaustag.
An einem Freitag den 13. würde ich im Lift nicht in den 13. Stock fahren.
Der 31. Oktober ist bekannt als Halloween.
Den 32. gibt es in keinem Monat.
Jeder 7. Tag ist ein Mittwoch.
Die Titanic startete am 10. April 1912 und am 15. sank sie schon.
Der 1. Januar ist der Anfang des Jahres.
Der 1. Offizier ist direkt dem Kapitän unterstellt.
Die 9. Symphonie Beethoven enthält die Melodie der Europahymne.
Den 77. Geburtstag feierten wir zusammen mit dem 50. Hochzeitstag.
Der 44. Präsident der USA hiess Barack Obama.
Am 5. jedes Monats machen wir Grosseinkauf.
Das 7. Weltwunder war der Tempel von Artemis.
Der 16. Jänner gilt als der kälteste Tag des Jahres.
Nach der 8. Tasse Kaffee habe ich genug.
Bis zum 17. Juli haben wir Schule, dann beginnen unsere Ferien.
Er versuchte es noch ein 18. Mal, bevor er aufgab.
Der 2. Tag der Woche ist Dienstag.
Der 19. September ist bei uns ein Feiertag.
Der 22. Februar im 3. Winter war der letzte kalte Tag.
Reinhold Messner war der 1. Mensch, der den Mount Everest ohne Sauerstoff bestieg.
Ich lese das Buch zum 4. Mal.
Um den 10. August herum sieht man viele Sternschnuppen.
Im November ist es oft kalt und nass, aber nach dem 25. hatten wir schönes Wetter.
Im 2. Semester wird es dann schwieriger.
Heute ist der 3. Juli.
Bis zum 11. Kapitel hielt ich durch.
Kaiser Karl V. herrschte über Spanien, Italien und das Deutsche Reich.
Franz II. war der letzte römisch-deutsche Kaiser.
Franz I. war zuerst Herzog von Lothringen, bevor er Kaiser des Heiligen Römischen Reiches wurde.
König Karl IV. von Ungarn wurde am 30. Juli 1918 zum letzten Kaiser von Österreich gekrönt.
Rudolf III. nannte man "den Schweigsamen".
Albrecht VII. von Habsburg war ein wichtiger Mäzen der Kunst seiner Zeit.
I think you mostly need to follow along with the GSD standard by looking in the dataset.
https://github.com/UniversalDependencies/UD_German-GSD
For example, the follow two examples. I'm not sure if the difference in lemma is a glitch or intentional usage based on the context.
1 5. 5. ADV ADV _ 2 dep _ FixTigerDep=Yes
2 Ich ich PRON PPER Case=Nom|Number=Sing|Person=1|PronType=Prs 3 nsubj _ _
3 bedaure bedauern VERB VVFIN Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 0 root _ _
4 das der DET ART Case=Acc|Definite=Def|Gender=Neut|Number=Sing|PronType=Art 6 det _ _
5 Karlsruher Karlsruher ADJ ADJA Case=Acc|Gender=Neut|Number=Sing 6 amod _ _
6 Urteil Urteil NOUN NN Case=Acc|Gender=Neut|Number=Sing 3 obj _ SpaceAfter=No
7 . . PUNCT $. _ 3 punct _ _
1 Der der DET ART Case=Nom|Definite=Def|Gender=Masc|Number=Sing|PronType=Art 2 det _ _
2 Aschermittwoch Aschermittwoch PROPN NN Case=Nom|Gender=Masc|Number=Sing 3 nsubj _ NamedEntity=Yes
3 kommt kommen VERB VVFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
4 bestimmt bestimmt ADJ ADJD _ 3 xcomp _ _
5 -- -- PUNCT $( _ 10 punct _ _
6-7 am _ _ _ _ _ _ _ _
6 an an ADP APPR _ 9 case _ _
7 dem der DET ART Case=Dat|Definite=Def|Gender=Neut|Number=Sing|PronType=Art 9 det _ _
8 21. 21 ADJ ADJA Case=Dat|Gender=Masc|Number=Sing 9 amod _ _
9 Februar Februar PROPN NN Case=Dat|Gender=Masc|Number=Sing 10 obl _ NamedEntity=Yes
10 ist sein VERB VAFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 parataxis _ _
11 wieder wieder ADV ADV _ 10 advmod _ _
12 `` `` PUNCT $( _ 13 punct _ SpaceAfter=No
13 alles alle PRON PIS Case=Nom|Definite=Ind|Gender=Neut|Number=Sing|PronType=Ind 10 nsubj _ _
14 vorbei vorbei ADV PTKVZ _ 10 advmod _ SpaceAfter=No
15 '' '' PUNCT $( _ 10 punct _ SpaceAfter=No
16 . . PUNCT $. _ 3 punct _ _
Yes, the existing training data does not treat it consistently. Just looking at the "dev" set:
dev-s576,dev-s585,dev-s607,dev-s609,dev-s610,dev-s611,dev-s637, treat the ordinal number asword="21.", lemma="21", pos=ADV.dev-s528,dev-s566,dev-s621treat the ordinal number asword="21.", lemma="21.", pos=ADV.dev-s29,dev-s511,dev-s461treat the ordinal number as two words:word="21", lemma="21", pos=NUMandword=".", lemma=".", pos=PUNCT
I think this should be fixed, otherwise the models will not learn it in a consistent way.
In my opinion, the first variant makes most sense.
Are you the maintainer of the data set?
In looking this over, there seem to be two patterns: days of months are lemmatized without periods and tagged ADJ/ADJA
34 31. 31 ADJ ADJA Case=Dat|Gender=Masc|Number=Sing 35 amod _ _
35 Mai Mai PROPN NN Case=Dat|Gender=Masc|Number=Sing 41 obl _ NamedEntity=Yes
otherwise, it is lemmatized with the period and tagged ADV/ADV
10. Interessierte Länder und Organisationen werden gebeten, bei der Umsetzung dieses Abkommens zu helfen.
2. Soziale Bewegungen leiten ihre Legitimation auch daraus ab, daß Parteien nicht imstande sind, Bürgeranliegen mit der gebotenen Konsequenz zu ihren eigenen zu machen.
One exception appears to be this, lemmatized with period and tagged NUM/ADJA
Platz (0,07 Sekunden vor dem 16. - platzierten Schweden Martin Hansson, also eine recht knappe Entscheidung) 16 Punkte und war Weltcupsieger mit 1268 Punkten vor Raich mit 1255 Punkten.
FWIW, the days of months situation is IMHO very similar to the English case:
Am 27. Mai On May 27th
so I would expect them to have the same treatment. Both are dates, and in both cases, the day is an ordinal number meaning "the 27th day of May".
However, in the following English dataset, days of month seem to be lemmatized consistently as NOUN (just search the dataset for "1st", "2nd", "3rd" etc.):
https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu
This is similar to the following sentence, where "first" is a noun:
and will be making those decisions after the first of the year.
Shouldn't the German day-of-month ordinal numbers be treated as a noun too?
... maybe? The only things I know about German annotation are what I can extrapolate from reading the annotated dataset, which may be incorrect. You would be better off posting the exact comment you just sent to us as a new issue here:
https://github.com/UniversalDependencies/UD_German-GSD
On Thu, Nov 25, 2021 at 12:16 AM Yolp @.***> wrote:
FWIW, the days of months situation is IMHO very similar to the English case:
Am 27. Mai On May 27th
so I would expect them to have the same treatment. Both are dates, and in both cases, the day is an ordinal number meaning "the 27th day of May".
However, in the following English dataset, days of month seem to be lemmatized consistently as NOUN (just search the dataset for "1st", "2nd", "3rd" etc.):
https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu
This is similar to the following sentence, where "first" is a noun:
and will be making those decisions after the first of the year.
Shouldn't the German day-of-month ordinal numbers be treated as a noun too?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/617#issuecomment-978937119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOQQG7H5OMMTFQDCADUNXWHVANCNFSM4XI437SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Done: https://github.com/UniversalDependencies/UD_German-GSD/issues/24