spaCy
spaCy copied to clipboard
en_core_web_trf (3.8.0) ORG predictions seem inaccurate compared to en_core_web_trf (3.6.1)
en_core_web_trf (3.8.0) labels CARDINAL tokens as ORG. This happens in the affiliation sections for many scientific manuscript I tried out. Interestingly, only the transformer pipeline has this new unexpected behavior, NER from en_core_web_lg (3.8.0) works as expected.
How to reproduce the behaviour
from pprint import pprint
import spacy
text = "Kelly E. Williams 1,2,3* , Kathryn P. Huyvaert 2 , Kurt C. Vercauteren 1 , Amy J. Davis 1 , Antoinette J. Piaggio 1\n1 USDA, Wildlife Services, National Wildlife Research Center, Wildlife Genetics Lab, 4101 Laporte Avenue, Fort Collins, CO, USA\n2 Department of Fish, Wildlife, and Conservation Biology, Colorado State University, Fort Collins, CO, 80523, USA\n3 School of Environmental and Forest Sciences, University of Washington, Seattle, WA, USA"
nlp = spacy.load("en_core_web_trf")
doc = nlp(text)
pprint([ent for ent in doc.ents if ent.label_ == "ORG"])
spaCy version 3.6.1 ( en_core_web_trf (3.6.1) ) returns:
[USDA,
Wildlife Services,
National Wildlife Research Center,
Wildlife Genetics Lab,
Department of Fish, Wildlife, and Conservation Biology,
Colorado State University,
School of Environmental and Forest Sciences,
University of Washington]
spaCy version 3.8.4 (en_core_web_trf (3.8.0)) returns:
[USDA,
Wildlife Services,
National Wildlife Research Center,
Wildlife Genetics Lab,
USA
,
2 Department of Fish, Wildlife,, <=== ORG instead ORDINAL for "2"
Colorado State University,
USA
,
3 School of Environmental and Forest Sciences, <=== ORG instead ORDINAL for "3"
University of Washington]
Your Environment
Info about spaCy
- spaCy version: 3.8.4
- Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
- Platform: macOS-15.2-arm64-arm-64bit
- Python version: 3.11.11
- Pipelines: en_core_web_trf (3.8.0)
UPD: the same issue for en_core_web_trf (3.7.3) :
Info about spaCy
- spaCy version: 3.7.6
- Platform: macOS-15.2-arm64-arm-64bit
- Python version: 3.11.11
- Pipelines: en_core_web_trf (3.7.3)
["'1 USDA'", <-----
"'Wildlife Services'",
"'National Wildlife Research Center'",
"'Wildlife Genetics Lab'",
"'USA\n'",
"'2 Department of Fish, Wildlife,'", <-----
"'Colorado State University'",
"'USA\n'",
"'3 School of Environmental and Forest Sciences'", <-----
"'University of Washington'"]