spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

NER performance in German pipeline 3.4.0

Open nouman44 opened this issue 3 years ago • 1 comments

There is a performance decrease in NER in the new German pipeline de_core_news_lg 3.4.0 when compared to the older version of the German pipeline de_core_news_lg 2.3.0. It is not able to detect simple LOC entities as it did in the older version (2.3.0). Are there any specific reasons for this or is there a change in the training data itself?

Below I have shown some samples where the OLD: specifies LOC entities from German pipeline de_core_news_lg 2.3.0 and NEW: specified entities from German pipeline de_core_news_lg 3.4.0. The TEXT: specifies the text passed to the pipeline The format is: ent.text (ent.start_char, ent.end_char) ent.label_. None specifies no LOC entity was detected

  • TEXT: Hamburg, 18. Mai 20182. OLD: Hamburg (0, 7) LOC NEW: None

  • TEXT: Mannheim, 20. November 2019 OLD: Mannheim (0, 8) LOC NEW: None

  • TEXT: Aufsichtsrat: Bankkaufmann Michael Maletz, Isernhagen Vorsitzender –. Kaufmann Heinz-JĂĽrgen Prahl, Hamburg . Kaufmann Thomas Prahl, Hamburg. . Die BezĂĽge des Aufsichtsrates betrugen fĂĽr 2019 EUR 2.250,00.. . Vorstand: Kaufmann Harald Leichnitz, Laatzen. Kaufmann Ralph Zimmermann, Wietze. . Im Geschäftsjahr 2019 wurden durchschnittlich beschäftigt:. OLD: Hamburg (99, 106) LOC, Hamburg (132, 139) LOC NEW: Hamburg (99, 106) LOC

  • TEXT: EIG Energy Fund XVI (Scotland), L.P., Washington, DC/USA OLD:Scotland (21, 29) ORG, L.P. (32, 36) ORG, Washington (38, 48) LOC, DC (50, 52) LOC, USA (53, 56) LOC NEW: Scotland (21, 29) ORG, DC/USA (50, 56) ORG

  • TEXT: Bratislava, Slowakei OLD: Bratislava (0, 10) LOC, Slowakei (12, 20) LOC NEW: None

  • TEXT: Bei der 1995 erworbenen Beteiligung handelt es sich um die Bundeskreditgarantiegemeinschaft des Handwerks GmbH, Berlin. OLD: Berlin (112, 118) LOC NEW: Berlin (112, 118) ORG

  • TEXT: Berlin, 9. März 2020 OLD: Berlin (0, 6) LOC NEW: None

  • TEXT: Oslo, Norwegen OLD: Oslo (0, 4) LOC, Norwegen (6, 14) LOC NEW: None

  • TEXT: Montreal, Kanada OLD: Montreal (0, 8) LOC, Kanada (10, 16) LOC NEW: None

  • TEXT: Im Jahr 2019 war weltweit eine weitere Wachstumsabschwächung zu beobachten. Vor allem in Europa, und dabei besonders in Deutschland, waren, wie im Abschnitt “Gesamtwirtschaftliche Rahmenbedingungen in Deutschland” dargestellt, die Auswirkungen des Handelskonflikts zwischen den USA und China in einem deutlich zurĂĽckgehenden Wachstum des Bruttoinlandsprodukts sichtbar. OLD: Europa (89, 95) LOC, Deutschland (120, 131) LOC, Deutschland (201, 212) LOC, USA (278, 281) LOC, China (286, 291) LOC NEW: Europa (89, 95) LOC, Deutschland (120, 131) LOC, China (286, 291) LOC

Environment

  • Operating System: macOS 12.5.1
  • Python Version Used: 3.9
  • spaCy Version Used: 3.4.1

nouman44 avatar Aug 30 '22 12:08 nouman44

Thanks for the report, it's always interesting to see examples like this, even though in many cases there isn't any immediate action that we can take (see #3052). Some thoughts:

In general the overall NER performance for the v3 trained pipelines is very similar to v2, but with some cases with minor drops in performance in v3, on the order of 0.01 f-score for the published pipelines. I think you could also find similar examples where entities are identified in v3 and not v2.

From similar issues for v3, I can say that users have noticed cases where doc-initial and especially short doc-initial entities like first names seem worse in v3 than v2, but we haven't found any obvious causes since there weren't any major changes to the NER algorithm in v3.

We've been experimenting some with lowercase augmentation (already starting in v2.3) and whitespace augmentation (in v3.4+), which may be interacting with this.

In general short texts without context and for the pipelines trained on WikiNER, texts that don't look like wikipedia genre-wise are going to have lower performance.

adrianeboyd avatar Sep 05 '22 12:09 adrianeboyd

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] avatar Feb 06 '23 14:02 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Mar 09 '23 00:03 github-actions[bot]