stanza icon indicating copy to clipboard operation
stanza copied to clipboard

German ordinal numbers leading to split sentences

Open yolpsoftware opened this issue 4 years ago • 18 comments

Describe the bug In German, ordinal numbers have a dot after the number:

  • On the 1st day --> Am 1. Tag
  • The 23rd item --> Der 23. Eintrag

Furthermore, the day-of-the-month part of a date is always an ordinal number:

  • September 30 --> Der 30. September (i.e. "the thirtieth September")

It seems that under some circumstances, the default German model of Stanza is confused by the ordinal numbers dot:

Am 20. März 1905 wurde der FC Chelsea gegründet.

This sentence leads to the words [['am', '20', '.'], ['März', '1905', 'wurde', 'der', 'FC', 'Chelsea', 'gegründet']], i.e. Stanza detects two sentences that both do not make sense on their own.

Am 21. Juni beginnt der Sommer.

This sentence (sometimes!) leads to the words [['Am', '21', '.'], ['Juni', 'beginnt', 'der', 'Sommer']]. Sometimes this returns the correct result (just one sentence), but sometimes not. Not sure if the Stanza model is deterministic, or if there is some random factor included.

To Reproduce Steps to reproduce the behavior: import stanza stanza.download('de') nlp = stanza.Pipeline('de') nlp('Am 20. März 1905 wurde der FC Chelsea gegründet.') nlp('Am 21. Juni beginnt der Sommer.')

Expected behavior Since "Am 20. März 1905 wurde der FC Chelsea gegründet." is one correct German sentence, we expect Stanza to return the POS tags for just one sentence, instead of two.

Environment (please complete the following information):

  • OS: Ubuntu 20.04.1 under Windows 10 (WSL).
  • Python version: 3.8.5 under virtualenv
  • Stanza version: 1.2

yolpsoftware avatar Feb 08 '21 12:02 yolpsoftware

Would it be useful to add a multi-word tokenization here? There could be a way to ensure that number and dot coexist as one token.

anjali-rgpt avatar Aug 22 '21 09:08 anjali-rgpt

MWT is an interesting idea, but it's quite different from the existing annotation standard in GSD. For dates, it keeps it as one token, and for lists it splits it into two tokens.

Overall I would have to say the problem is because . with the space is the end of a sentence a huge portion of the time in the training data. Characters seems to be insufficient for reliably distinguishing periods which can continue a sentence and periods which can't. One possibility would be a model which looks at more of the surrounding word context, not just the characters.

AngledLuffa avatar Aug 23 '21 00:08 AngledLuffa

There are other scenarios where a dot . with a space is not the end of a sentence, e.g. abbreviations ("U.S. Marshal", "at 4 p.m. or later"). If Stanza is able to parse those, then it seems you have enough of them in the training data.

Why not just add some German sentences with ordinal numbers to the training data?

yolpsoftware avatar Aug 23 '21 07:08 yolpsoftware

Why not just add some German sentences with ordinal numbers to the training data?

None of us speak German

AngledLuffa avatar Aug 23 '21 17:08 AngledLuffa

I can gladly provide some examples. How many do you need, approximately?

I assume you already have some German sentences, otherwise you could not train a German model. So, you only need some additional examples with ordinal numbers, right?

yolpsoftware avatar Aug 23 '21 18:08 yolpsoftware

Yeah, the current default German dataset is here:

https://github.com/UniversalDependencies/UD_German-GSD

There are 23 examples of regex [0-9][.] which doesn't end a sentence, and a few hundred which do, so you see why it would frequently make the wrong choice. I can't really pretend to know the right answer, but it's going to be more than a handful and less than War & Peace. Probably one for each numeral and one for each day / month is a good start, although we might actually need the product rather than the sum of those numbers. Autogenerating sentences is probably bad, as we might accidentally introduce some biases where it only handles the new splits correctly in the exact same context as the autogenerated sentences. It doesn't need a full pos/dep/etc treatment; we can train the tokenizer separately from the rest. Anyway, the point is I'm happy to rebuild models and report the results a few times as you come up with new text.

AngledLuffa avatar Aug 23 '21 21:08 AngledLuffa

When you say, it does not need the full "pos/dep/etc" treatment, what does that mean exactly? How much of the following data do I need to supply?

image

yolpsoftware avatar Nov 14 '21 22:11 yolpsoftware

None of it. All we need is the tokenization and any possible MWT.

AngledLuffa avatar Nov 14 '21 22:11 AngledLuffa

Just to be sure, for the example above, "Manasse ist ein einzigartiger Parfümeur.", you just need

Manasse
ist
ein
einzigartiger
Parfümeur
.

For Multi-Word Tokens (i.e. several words packed into one), you just need an expansion, i.e. the individual words?

yolpsoftware avatar Nov 15 '21 08:11 yolpsoftware

correct

On Mon, Nov 15, 2021, 12:21 AM Yolp @.***> wrote:

Just to be sure, for the example above, "Manasse ist ein einzigartiger Parfümeur.", you just need

Manasse

ist

ein

einzigartiger

Parfümeur

.

For Multi-Word Tokens (i.e. several words packed into one), you just need an expansion, i.e. the individual words?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/617#issuecomment-968641538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOT3WACQCUTWWX26L3UMC7IBANCNFSM4XI437SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

AngledLuffa avatar Nov 15 '21 15:11 AngledLuffa

And, the correct tokenization for an ordinal number is to leave the dot next to the number, right?

Am
23.
Oktober
habe
ich
Geburtstag
.

Is 23. an MWT, because it is composed of the number and the dot? Or should it be treated as a single token?

yolpsoftware avatar Nov 15 '21 19:11 yolpsoftware

Here my results so far. Tokenization is mostly trivial (split spaces and commas, don't split dots).

In what format should I submit them?

Do we need much more?

Please note, I'm a native German speaker, but I cannot guarantee that the spelling and punctuation is 100% correct.

Am 23. Oktober habe ich Geburtstag.
Der 2. Weltkrieg begann am 1. September 1939.
Der 1. Mai ist der Tag der Arbeit.
Der 29. Februar existiert nur in einem Schaltjahr.
Am 2. Tag begannen sie mit der Arbeit.
Gott ruhte am 7. Tag.
Am 20. März 1905 wurde der FC Chelsea gegründet.
Am 21. Juni beginnt der Sommer.
Beim 12. Versuch waren wir zum 1. Mal erfolgreich.
Heiligabend ist am 24., Weihnachten am 25. und Stephanstag am 26. Dezember.
Kommst du am 14. März oder am 15.?
Kommst du am 14. oder am 15. März?
Am 6. Dezember ist St. Nikolaustag.
An einem Freitag den 13. würde ich im Lift nicht in den 13. Stock fahren.
Der 31. Oktober ist bekannt als Halloween.
Den 32. gibt es in keinem Monat.
Jeder 7. Tag ist ein Mittwoch.
Die Titanic startete am 10. April 1912 und am 15. sank sie schon.
Der 1. Januar ist der Anfang des Jahres.
Der 1. Offizier ist direkt dem Kapitän unterstellt.
Die 9. Symphonie Beethoven enthält die Melodie der Europahymne.
Den 77. Geburtstag feierten wir zusammen mit dem 50. Hochzeitstag.
Der 44. Präsident der USA hiess Barack Obama.
Am 5. jedes Monats machen wir Grosseinkauf.
Das 7. Weltwunder war der Tempel von Artemis.
Der 16. Jänner gilt als der kälteste Tag des Jahres.
Nach der 8. Tasse Kaffee habe ich genug.
Bis zum 17. Juli haben wir Schule, dann beginnen unsere Ferien.
Er versuchte es noch ein 18. Mal, bevor er aufgab.
Der 2. Tag der Woche ist Dienstag.
Der 19. September ist bei uns ein Feiertag.
Der 22. Februar im 3. Winter war der letzte kalte Tag.
Reinhold Messner war der 1. Mensch, der den Mount Everest ohne Sauerstoff bestieg.
Ich lese das Buch zum 4. Mal.
Um den 10. August herum sieht man viele Sternschnuppen.
Im November ist es oft kalt und nass, aber nach dem 25. hatten wir schönes Wetter.
Im 2. Semester wird es dann schwieriger.
Heute ist der 3. Juli.
Bis zum 11. Kapitel hielt ich durch.
Kaiser Karl V. herrschte über Spanien, Italien und das Deutsche Reich.
Franz II. war der letzte römisch-deutsche Kaiser.
Franz I. war zuerst Herzog von Lothringen, bevor er Kaiser des Heiligen Römischen Reiches wurde.
König Karl IV. von Ungarn wurde am 30. Juli 1918 zum letzten Kaiser von Österreich gekrönt.
Rudolf III. nannte man "den Schweigsamen".
Albrecht VII. von Habsburg war ein wichtiger Mäzen der Kunst seiner Zeit.

yolpsoftware avatar Nov 15 '21 20:11 yolpsoftware

I think you mostly need to follow along with the GSD standard by looking in the dataset.

https://github.com/UniversalDependencies/UD_German-GSD

For example, the follow two examples. I'm not sure if the difference in lemma is a glitch or intentional usage based on the context.

1       5.      5.      ADV     ADV     _       2       dep     _       FixTigerDep=Yes
2       Ich     ich     PRON    PPER    Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _       _
3       bedaure bedauern        VERB    VVFIN   Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   0       root    _       _
4       das     der     DET     ART     Case=Acc|Definite=Def|Gender=Neut|Number=Sing|PronType=Art      6       det     _       _
5       Karlsruher      Karlsruher      ADJ     ADJA    Case=Acc|Gender=Neut|Number=Sing        6       amod    _       _
6       Urteil  Urteil  NOUN    NN      Case=Acc|Gender=Neut|Number=Sing        3       obj     _       SpaceAfter=No
7       .       .       PUNCT   $.      _       3       punct   _       _
1       Der     der     DET     ART     Case=Nom|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      2       det     _       _
2       Aschermittwoch  Aschermittwoch  PROPN   NN      Case=Nom|Gender=Masc|Number=Sing        3       nsubj   _       NamedEntity=Yes
3       kommt   kommen  VERB    VVFIN   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       _
4       bestimmt        bestimmt        ADJ     ADJD    _       3       xcomp   _       _
5       --      --      PUNCT   $(      _       10      punct   _       _
6-7     am      _       _       _       _       _       _       _       _
6       an      an      ADP     APPR    _       9       case    _       _
7       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Neut|Number=Sing|PronType=Art      9       det     _       _
8       21.     21      ADJ     ADJA    Case=Dat|Gender=Masc|Number=Sing        9       amod    _       _
9       Februar Februar PROPN   NN      Case=Dat|Gender=Masc|Number=Sing        10      obl     _       NamedEntity=Yes
10      ist     sein    VERB    VAFIN   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   3       parataxis       _       _
11      wieder  wieder  ADV     ADV     _       10      advmod  _       _
12      ``      ``      PUNCT   $(      _       13      punct   _       SpaceAfter=No
13      alles   alle    PRON    PIS     Case=Nom|Definite=Ind|Gender=Neut|Number=Sing|PronType=Ind      10      nsubj   _       _
14      vorbei  vorbei  ADV     PTKVZ   _       10      advmod  _       SpaceAfter=No
15      ''      ''      PUNCT   $(      _       10      punct   _       SpaceAfter=No
16      .       .       PUNCT   $.      _       3       punct   _       _

AngledLuffa avatar Nov 18 '21 01:11 AngledLuffa

Yes, the existing training data does not treat it consistently. Just looking at the "dev" set:

  • dev-s576, dev-s585, dev-s607, dev-s609, dev-s610, dev-s611, dev-s637, treat the ordinal number as word="21.", lemma="21", pos=ADV.
  • dev-s528, dev-s566, dev-s621 treat the ordinal number as word="21.", lemma="21.", pos=ADV.
  • dev-s29, dev-s511, dev-s461 treat the ordinal number as two words: word="21", lemma="21", pos=NUM and word=".", lemma=".", pos=PUNCT

I think this should be fixed, otherwise the models will not learn it in a consistent way.

In my opinion, the first variant makes most sense.

Are you the maintainer of the data set?

yolpsoftware avatar Nov 18 '21 08:11 yolpsoftware

In looking this over, there seem to be two patterns: days of months are lemmatized without periods and tagged ADJ/ADJA

34      31.     31      ADJ     ADJA    Case=Dat|Gender=Masc|Number=Sing        35      amod    _       _
35      Mai     Mai     PROPN   NN      Case=Dat|Gender=Masc|Number=Sing        41      obl     _       NamedEntity=Yes

otherwise, it is lemmatized with the period and tagged ADV/ADV

10. Interessierte Länder und Organisationen werden gebeten, bei der Umsetzung dieses Abkommens zu helfen.
 2. Soziale Bewegungen leiten ihre Legitimation auch daraus ab, daß Parteien nicht imstande sind, Bürgeranliegen mit der gebotenen Konsequenz zu ihren eigenen zu machen.

One exception appears to be this, lemmatized with period and tagged NUM/ADJA

Platz (0,07 Sekunden vor dem 16. - platzierten Schweden Martin Hansson, also eine recht knappe Entscheidung) 16 Punkte und war Weltcupsieger mit 1268 Punkten vor Raich mit 1255 Punkten.

AngledLuffa avatar Nov 25 '21 00:11 AngledLuffa

FWIW, the days of months situation is IMHO very similar to the English case:

Am 27. Mai On May 27th

so I would expect them to have the same treatment. Both are dates, and in both cases, the day is an ordinal number meaning "the 27th day of May".

However, in the following English dataset, days of month seem to be lemmatized consistently as NOUN (just search the dataset for "1st", "2nd", "3rd" etc.):

https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu

This is similar to the following sentence, where "first" is a noun:

and will be making those decisions after the first of the year.

Shouldn't the German day-of-month ordinal numbers be treated as a noun too?

yolpsoftware avatar Nov 25 '21 08:11 yolpsoftware

... maybe? The only things I know about German annotation are what I can extrapolate from reading the annotated dataset, which may be incorrect. You would be better off posting the exact comment you just sent to us as a new issue here:

https://github.com/UniversalDependencies/UD_German-GSD

On Thu, Nov 25, 2021 at 12:16 AM Yolp @.***> wrote:

FWIW, the days of months situation is IMHO very similar to the English case:

Am 27. Mai On May 27th

so I would expect them to have the same treatment. Both are dates, and in both cases, the day is an ordinal number meaning "the 27th day of May".

However, in the following English dataset, days of month seem to be lemmatized consistently as NOUN (just search the dataset for "1st", "2nd", "3rd" etc.):

https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu

This is similar to the following sentence, where "first" is a noun:

and will be making those decisions after the first of the year.

Shouldn't the German day-of-month ordinal numbers be treated as a noun too?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/617#issuecomment-978937119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOQQG7H5OMMTFQDCADUNXWHVANCNFSM4XI437SQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

AngledLuffa avatar Nov 25 '21 08:11 AngledLuffa

Done: https://github.com/UniversalDependencies/UD_German-GSD/issues/24

yolpsoftware avatar Nov 25 '21 08:11 yolpsoftware