spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

📚 Inaccurate pre-trained model predictions master thread

Open ines opened this issue 5 years ago • 138 comments

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

ines avatar Dec 14 '18 11:12 ines

Doing an annotation project for pre-trained model with prodigy looks like a really good idea! Do you have some ideas when it could happens and who will be able to participate?

mauryaland avatar Dec 17 '18 19:12 mauryaland

From #3070: English models predict empty strings as tags (confirmed also in nightly).

>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp("I like  London and Berlin")
>>> [(t.tag_, t.pos_) for t in doc]
[('PRP', 'PRON'), ('VBP', 'VERB'), ('', 'SPACE'), ('NNP', 'PROPN'), ('CC', 'CCONJ'), ('NNP', 'PROPN')]

From #2313: Similar problem in French (confirmed also in nightly).

>>> nlp = spacy.load("fr_core_news_sm")
>>> doc = nlp("Nous a-t-elle précisé ceci?")
>>> [x.pos_ for x in doc] # ['PRON', '', 'PART', 'PRON', 'VERB', 'PRON', 'PUNCT']
['PRON', '', 'ADV', 'PRON', 'VERB', 'PRON', 'PUNCT']
>>> doc = nlp("Nous a t-elle précisé ceci?")
>>> [x.pos_ for x in doc] # ['PRON', 'AUX', 'NOUN', 'VERB', 'PRON', 'PUNCT']
['PRON', 'AUX', 'VERB', 'VERB', 'PRON', 'PUNCT']

ines avatar Dec 20 '18 11:12 ines

@mauryaland I hope we can have annotations starting in January. The first data to be annotated will be English and German, with other annotation projects hopefully starting fairly quickly.

We'll probably be hiring annotators to most of the work. We might do a little bit of "crowd sourcing" as a test, but we mostly believe annotation projects run better with fewer annotators. What we would benefit from is having one person per treebank overseeing the work, communicating with the annotators, and making language-specific annotation policy decisions.

honnibal avatar Dec 20 '18 11:12 honnibal

I am trying to upgrade from 2.0.x to 2.1 but seeing different results for small English model. It is not meaningful come to conclusion case by case but I see accuracy decreased with some POS and Dependency tagging.

In general verb become noun. Dependencies lost or changed. Especially this one: (2.0 )tight/VB [advmod] tight/RB (2.1) tight/VB [acomp] tight/RB > to be able to acomp tight should be adjective I guess.

Should I assume new models (2.1.0a5) will change when 2.1 comes to release or we should not expect change.

POS Problems

VB > NN doubles/VB > NNS
From the restaurant, the Seventh's boundary doubles back east along the Pennsylvania Turnpike.

hooking/VB > NN
No hooking up with college kids.

Dependency changes

sit tight
tight/VB [advmod] tight/RB > advmod > (2.1) acomp (adjectival complement but tight is adverb)
As for Russia's sovereign debt, most investors are sitting tight, believing Washington will not bar investors from it, even if the U.S.

speed [compound] skating > no dep (2.1)
Yes, but all the Dutch medals are in speed skating only.

cross [acl] examined > no dep
Mr Goodwin is due to be cross examined on 8 June, the day of the general election.

test [dep] fly > dependency reversed They hope to test-fly their craft at Clow International Airport.

cross [npadvmod] examines
(2.1) dependencies connected over punctiation "-" The witness was cross-examined by the defense.

mehmetilker avatar Jan 12 '19 13:01 mehmetilker

Hi guys !

Just a quick question regarding the missing tags issue mentionned above (From #2313: Similar problem in French (confirmed also in nightly).) : does this come from the models ? Are you working on this ? In case it can help, I am adding examples with missing tags :

  • Qu'est-ce qui va augmenter ? (current tags: ['', '', 'PRON', 'PRON', 'VERB', 'VERB', 'PUNCT']), missing for "Qu'" and "est"
  • Est-ce qu'il y a un poème ce matin ? (current tags: ['', 'PRON', 'SCONJ', 'PRON', 'PRON', 'VERB', 'DET', 'NOUN', 'DET', 'NOUN', 'PUNCT']), missing for "Est".
  • Laquelle a-t-elle été ? (current tags: ['PROPN', '', 'PART', 'PRON', 'AUX', 'PUNCT']), missing for "a".

Thank you!

amperinet avatar Mar 27 '19 10:03 amperinet

I'm not sure whether this belongs here or in its own issue, but I noticed that the tagger in spacy 2.1 en_core_web_md (2.1.0) seems to have some major problems.

I ran a quick evaluation on the PTB tags in UD_English-PUD with the following results (without normalizing punctuation tags, so the actual results would be a bit higher):

Model Tagging Acc.
------------------
sm       0.945
md       0.792
lg       0.952

The performance is similar for UD tags and for other corpora. With spacy 2.0, the results for all three models are similar.

I suspect these problems are what led to this hacky modification to a model test case, which now doesn't catch the error it's supposed to catch:

https://github.com/explosion/spacy-models/commit/b516a3bd066f8dc483e69a7aa99a26ea9566d687#diff-09cdc890bfe36b8c3ac094953ad251bd

Below are simplified confusion matrices for the more frequent non-punctuation tags for md vs lg, where you can see that something has gone wrong in the md model (sm looks similar to lg). I was hoping to see a clear pattern that explained the errors (like two consistently swapped tags), but it's so all over the place that my first guess would be that there was an offset error for some portion of the training data.

spacy21_en_core_web_md spacy21_en_core_web_lg

adrianeboyd avatar Apr 08 '19 07:04 adrianeboyd

Model (at least "en_core_web_sm") fails in prediction whenever capitalization is not correctly used. For example, comparing the predictions for "j.k. rowling wishes snape happy birthday in the most magical way" and "J.K. Rowling Wishes Snape Happy Birthday In The Most Magical Way" https://puu.sh/DiNTh/d3b940ef65.png

First has "rowling" considered as a verb, despite wishes being a verb. Second tend too easily to assign NNP and assigning "ROOT" Best version would be "J.K. Rowling wishes Snape happy birthday in the most magical way", which still makes "Snape" ent_typeB GPE.

This kind of errors are costant every time a supposedly capitalized name ( e.g "United States" ) isn't capitalized or a supposedly non - capitalized name is capitalizaed. This gives problem applying the model with, for example, headlines (which have all words first letters capitalized)

alessio-greco avatar Apr 23 '19 17:04 alessio-greco

In [6]: nlp('acetaminophen')[0].tag_, nlp('acetaminophen')[0].pos_
Out[6]: ('UH', 'INTJ')

adam-ra avatar May 06 '19 08:05 adam-ra

Hi, Lemma for "multiplies" should be "multiply" right?

(Pdb) tmp = nlp(u"A rabbit multiplies rapidly by having lots of sex.") (Pdb) tmp A rabbit multiplies rapidly by having lots of sex. (Pdb) [token.lemma_ for token in tmp] [u'a', u'rabbit', u'multiplie', u'rapidly', u'by', u'have', u'lot', u'of', u'sex', u'.']

dlemke01 avatar May 25 '19 19:05 dlemke01

Sentence tokenisation issue? I thought the nonstandard lexis might be the cause, but normalising it still gives pretty unusual sentence tokenisation:

>>> import spacy                                                
>>> nlp = spacy.load(en)                                                                      
>>> s = 'Me and you are gonna have a talk. \nSez who? \nSez me. \nHey! What did I say?'
>>> doc = nlp(s)
>>> for sent_index, sent in enumerate(doc.sents, start=1):
...    print(sent_index, sent.text) 
                                                                                                                                                                 
1 Me and you are gonna have a talk. 
2 Sez
3 who? 
Sez me. 
4 Hey!
5 What did I say?

>>> s = 'Me and you are gonna have a talk. \nSays who? \nSays me. \nHey! What did I say?'
>>> doc = nlp(s)
>>> for sent_index, sent in enumerate(doc.sents, start=1): 
...     print(sent_index, sent.text) 
                                                                     
1 Me and you are gonna have a talk. 

2 Says who? 
Says me. 

3 Hey!
4 What did I say?

interrogator avatar Jun 14 '19 23:06 interrogator

Another incorrect lemma_

import spacy nlp = spacy.load('en_core_web_lg') doc = nlp("the Greys") [token.lemma_ for token in doc] ['the', 'Greys']

"Greys" should be "Grey"

ctrngk avatar Jun 16 '19 04:06 ctrngk

another lemma_ conflict

import spacy nlp = spacy.load('en_core_web_lg') [token.lemma_ for token in nlp("to be flattered by sth")] ['to', 'be', 'flatter', 'by', 'sth'] # correct [token.lemma_ for token in nlp("to feel flattered that")] ['to', 'feel', 'flattered', 'that'] #error

The second "flattered" should be "flatter"

ctrngk avatar Jun 16 '19 10:06 ctrngk

I've noticed the tokenization and entity recognition around compact numbers (e.g., 10k, 20M) can be a bit of a mixed bag. Here's a tiny snippet:

from pprint import pprint

from spacy.matcher import Matcher
import en_core_web_sm
import spacy

print(f"Spacy version: {spacy.__version__}")
# Spacy version: 2.1.4

nlp = en_core_web_sm.load()
doc = nlp("Compact number formatting: 5k 5K 1m 1M")
pprint([(t.text, t.ent_type_) for t in doc])
# [('Compact', ''),
#  ('number', ''),
#  ('formatting', ''),
#  (':', ''),
#  ('5k', 'CARDINAL'),
#  ('5', 'CARDINAL'),
#  ('K', 'ORG'),
#  ('1', 'ORG'),
#  ('m', ''),
#  ('1', 'CARDINAL'),
#  ('M', '')]

dataframing avatar Jun 20 '19 07:06 dataframing

Small English model 2.1 assigns pos_= VERB to "cat". Medium model works fine. Small and medium models 2.0 work fine, too. I thought I'd report this despite being a single word inaccuracy, since the example is from https://course.spacy.io/ from chapter1_03_rule-based-matching.md Section "Matching other token attributes"

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
matcher.add('loving', None, pattern)

# Process some text
doc = nlp("I loved dogs but now I love cats more.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.head.text)

I -PRON- PRON nsubj loved loved love VERB ROOT loved dogs dog NOUN dobj loved but but CCONJ cc loved now now ADV advmod love I -PRON- PRON nsubj love love love VERB conj loved cats cat VERB dobj love more more ADV advmod love . . PUNCT punct love

Gnuelp avatar Jul 19 '19 12:07 Gnuelp

Another issue where there is also a problem with the small English model 2.1, but not with medium model is related to https://github.com/explosion/spaCy/issues/3305.

import spacy

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
label = nlp.vocab.strings['GREETING']
print (label)

span_with_label = Span(doc, 0, 2, label=label)

# Add span to the doc.ents
doc.ents = [span_with_label]

12946562419758953770


ValueError Traceback (most recent call last) in 21 print (label) 22 ---> 23 span_with_label = Span(doc, 0, 2, label=label) 24 25 # Add span to the doc.ents

span.pyx in spacy.tokens.span.Span.cinit()

ValueError: [E084] Error assigning label ID 12946562419758953770 to span: not in StringStore.

Gnuelp avatar Jul 19 '19 12:07 Gnuelp

test case: label ID not in StringStore

Another issue where there is also a problem with the small English model 2.1, but not with medium model is related to #3305.

import spacy

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
label = nlp.vocab.strings['GREETING']
print (label)

span_with_label = Span(doc, 0, 2, label=label)

# Add span to the doc.ents
doc.ents = [span_with_label]

12946562419758953770

ValueError Traceback (most recent call last) in 21 print (label) 22 ---> 23 span_with_label = Span(doc, 0, 2, label=label) 24 25 # Add span to the doc.ents span.pyx in spacy.tokens.span.Span.cinit() ValueError: [E084] Error assigning label ID 12946562419758953770 to span: not in StringStore.

@Gnuelp Sorry for the late response here. That's a slightly tricky case where nlp.vocab.strings differentiates between just looking up a string vs. adding a new string to the StringStore. It will work with:

label = nlp.vocab.strings.add('GREETING')

As of spacy 2.1 this is simplified and you can just specify the label as a string in the Span:

span_with_label = Span(doc, 0, 2, label='GREETING')

adrianeboyd avatar Oct 02 '19 12:10 adrianeboyd

When I updated to 2.2, I noticed that the pre-trained POS tagger started automatically tagging all sentence-initial nouns as PROPN instead of NOUN as it did before. This is a good heuristic for some datasets, but it threw off my pipeline (my data contains mostly bare noun labels like "man with a newspaper" or "spy", where this behavior was unexpected.)

hawkrobe avatar Oct 24 '19 05:10 hawkrobe

@hawkrobe Interesting observation! This is an unexpected side effect of trying to make the models less sensitive to capitalization overall, with the intention of improving performance on data that doesn't look like formally edited newspaper text. The main difference to the 2.1 models is that some training sentences are randomly lowercased. Since your data is not really the kind of data spacy's models are intended for, I don't think it makes to try to optimize general-purpose models for this case (although it's still very useful to be made aware of these kinds of changes in behavior!).

One possible workaround is to use a framing sentence that you can insert your phrases into that looks more like newspaper text. Something like:

"The president saw the [bare NP] yesterday."

Then you are much more likely to get the correct analysis and you can extract the annotation that you need.

adrianeboyd avatar Oct 24 '19 12:10 adrianeboyd

@adrianeboyd aha, I appreciate the info about the changes that led to this side effect, and I love the idea of using the framing sentence to bring the bare NPs closer to in-sample text.

hawkrobe avatar Oct 28 '19 19:10 hawkrobe

Hi,

I have noticed some inconsistencies with Lemmatizer and Stop Words for the Italian model. I don't know if this is the best place where to report it but, please, forgive me, I am not very expert about ML, DL, models and so on, and I am learning. In particular I am not sure if lemmatization is perfomerd by the pretrained model or what else. Currently I am following this tutorial "Classify text using spaCy" to test and understand what spaCy is capable of.

I can summarize my issue with some code.

import spacy
from spacy.lang.it.stop_words import STOP_WORDS

nlp = spacy.load('it_core_news_sm')
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd, before="parser")

stop_words = STOP_WORDS

doc = 'Ciao marco, ieri ero sul lavoro per questo non ti risposi!'

# Create list of word tokens
token_list = []
for token in doc:
	token_list.append(token.text)
print(token_list)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
	sents_list.append(sent.text)
print(sents_list)

filtered_words = []
for word in doc:
	if word.is_stop is False:
		filtered_words.append(word)
print('filtered_words:', filtered_words)

# finding lemma for each word
for word in doc:
	print(word.text, word.lemma_)

This is the result.

['Ciao', 'marco', ',', 'ieri', 'ero', 'sul', 'lavoro', 'per', 'questo', 'non', 'ti', 'risposi', '!']
['Ciao marco, ieri ero sul lavoro per questo non ti risposi!']
filtered_words: [Ciao, marco, ,, risposi, !]
Ciao Ciao
marco marcare
, ,
ieri ieri
ero essere
sul sul
lavoro lavorare
per per
questo questo
non non
ti ti
risposi risposare
! !

First of all a translation of the text. The text means: "Hello marco, yesterday I was at work for this I didn't answer you!".

Then I can say that tokenization is correct.

Subdivision in sentences is not very important to me but I think it is correct even if I am not sure if "sentence" is synonymous of "period", if yes, it's correct; if not and "sentencizer" is assumed to split also the parts of a "period", it is wrong.

True problems come with Stop Words and Lemmatizer. I am not sure why Stop Words include especially "lavoro" ("work") and, eventually, "ieri" ("yesterday"). Can any Topic Detector extract any valuable meaning from only: [Ciao, marco, ,, risposi, !]? That is "Hello", the name "Marco" and the verb "to answer"...

Finally Lemmatization gives probably the worst results.

"Marco" is a name; "marcare" is a verb and means "to mark". And "risposi" means "I answered to you" and doesn't stem to "risposare" but "rispondere". "risposare" means "to marry again".

If I can fix such errors someway by myself let me know. I would really like to do it if I can. Otherwise hope that this can be of any help.

Any explanation or help is welcome, I am quite ignorant about this kind of softwares at the moment.

Thank you

endersaka avatar Nov 07 '19 04:11 endersaka

@endersaka: This is the correct place, thanks!

For languages where we don't have a better rule-based lemmatizer, spacy uses lemma lookup tables, which can be better than not having a lemmatizer, but they aren't great. There are some errors in the tables and the results are often not so great because it can't provide multiple lemmas for words like "risposi" or disambiguate based on POS tags. It looks like "risposi" is ambiguous (if you don't have any other context), so the simple lookup table is never going to handle this word correctly in all cases. It looks like "Marco" returns "Marco" and "marco" returns "marcare".

The one advantage of simple lookup tables is that they are easy to understand and modify. The table is here: https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/it_lemma_lookup.json

The stop words, which look like they've been unmodified since 2016, are here: https://github.com/explosion/spaCy/blob/master/spacy/lang/it/stop_words.py

Stop words are often pretty task-specific, so it's hard to provide a perfect list for everyone. Here's how to modify the stop words: https://stackoverflow.com/a/51627002/461847

We're always happy to get pull requests or other updates from people with more expertise in a particular language! If you potentially have a better source for Italian stop words, let us know...

adrianeboyd avatar Nov 07 '19 08:11 adrianeboyd

I've opened an issue before I saw that this thread existed. You can find it here: #4611

graftim avatar Nov 08 '19 09:11 graftim

@adrianeboyd sincerely thanks for the explanation. Currently I am taking a look at the documentation on spaCy Web site and more precisely to the Adding Languages page to understand the architecture.

About "risposi", yes! I confirm. It's a very special case. Actually the verb token of the character sequence "non ti risposi" can have two different meanings in two different tenses, depending on the context. In my example case the verb means "to answer" coniugated in simple past tense (that tense, in Italian, translates to "passato remoto", literally "remote past"); "ti", of course, is the pronoun "you" the object of the sentence, receiving the action. "non" is negation. In the case produced by the lemmatizer spaCy takes in to account the other meaning in which the verb is reflective with the help of the pronoun "ti" (also called pronominal verb). In English such reflective form (literally translated) would mean (to be as precise as possible) "you didn't marry yourself once again" :-D

Apart this, funny and interesting matter, I will read carefully the spaCy documentation and try to understand how the Lemmatizer works to eventually propose some modification in the future.

endersaka avatar Nov 08 '19 16:11 endersaka

@endersaka spacy models are not caseless. The same happens with the english model: basically it won't analyze a phrase with wrong casing well. Either train another model with a caseless dataset or try a truecaser if your dataset has wrong casing.

alessio-greco avatar Nov 09 '19 16:11 alessio-greco

@killia15 The problem is that there are not any occurrences of the word "tu" in the training data that we're using for French, which comes from an older release of this corpus: https://github.com/UniversalDependencies/UD_French-Sequoia/ . The sources are (according to their README): Europarl, Est Republicain newspaper, French Wikipedia and European Medicine Agency.

This is a relatively common problem in corpora that are based on formally edited texts like newspapers or encyclopedia-style texts like wikipedia. I see "vous" but not "tu". This is clearly not great!

Hopefully in the future we can train models with more/better data that won't have this problem. The newest version of the UD GSD corpora, which was released last week, have dropped the non-commercial restriction, so that can potentially provide more data for French (and a few other languages). If you're interested in training a model for French with UD_French-GSD to use now, I can provide a sketch of how to convert and train, which is pretty easy with spacy's CLI commands.

adrianeboyd avatar Nov 20 '19 07:11 adrianeboyd

@adrianeboyd That would certainly explain it! Though great timing with the UD GSD corpora. I would be very interested in training the model. Let me know how I can help. My goal for spacy is to use it for a project where we’re automatically analyzing French texts (news articles, blog posts, poems, passages from books etc) to predict which vocabulary and grammar structures a student will and won’t know so we can make recommendations to their instructor on what they should be working on. Our goal is to publish a paper on it so we’re certainly invested in the success of Spacy’s French model!

killia15 avatar Nov 20 '19 13:11 killia15

@killia15:

The current spacy release doesn't handle subtokens in a particularly good way (you only see the subtoken strings like de les rather than des), but you can convert the data and train a model like this:

spacy convert -n 10 -m train.conllu .
spacy convert -n 10 -m dev.conllu .
spacy train fr outout_dir train.json dev.json -p tagger,parser

After converting the data, you'll have tags that look like this:

NOUN__Gender=Masc|Number=Sing

After converting and before training, make sure the current lang/fr/tag_map.py has the tags you need. The current tag map just maps to the UD tags like this, so if you don't need the morphological features, you'll just need to check that none are missing (you may have some new combinations of morphological features):

"NOUN__Gender=Masc|Number=Sing": {POS: NOUN},

If you'd like better access to the morphological features (not just as a clunky token.tag_ string), you can expand the mapping to include the features:

"NOUN__Gender=Masc|Number=Sing": {POS: NOUN, 'Gender': 'Masc', 'Number': 'Sing'},

Spacy supports the UD morphological features, so you should be able to do this automatically from a list of the tags in the converted training data. (In the future there should be a statistical morphological tagger, but for now the morphological features are just mapped from the part-of-speech tags.)

adrianeboyd avatar Nov 22 '19 09:11 adrianeboyd

As reported here, I encounter some very suspicious parses when using the GPU. Using the CPU version works as expected. Below you find some examples, first the source sentence followed by its tokenisation (which looks alright), then the POS (which look off), and finally DEP (which seems random/uninitialized):

s = "The decrease in 2008 primarily relates to the decrease in cash and cash equivalents 1.\n"
['The', 'decrease', 'in', '2008', 'primarily', 'relates', 'to', 'the', 'decrease', 'in', 'cash', 'and', 'cash', 'equivalents', '1', '.', '\n']
['VERB', 'PRON', 'PROPN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NUM', 'PRON', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'VERB', 'VERB', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'ROOT', '']

s = "The Company's current liabilities of €32.6 million primarily relate to deferred income from collaborative arrangements and trade payables.\n"
['The Company', "'s", 'current', 'liabilities', 'of', '&', 'euro;32.6', 'million', 'primarily', 'relate', 'to', 'deferred', 'income', 'from', 'collaborative', 'arrangements', 'and', 'trade', 'payables', '.', '\n']
['NOUN', 'VERB', 'AUX', 'NOUN', 'NOUN', 'PROPN', 'PROPN', 'PROPN', 'VERB', 'VERB', 'ADV', 'VERB', 'VERB', 'NOUN', 'NOUN', 'PROPN', 'NOUN', 'PROPN', 'VERB', 'NUM', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'punct', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'ROOT', '']

s = 'The increase in deferred income is related to new deals with partners.\n'
['The', 'increase', 'in', 'deferred', 'income', 'is', 'related', 'to', 'new', 'deals', 'with', 'partners', '.', '\n']
['NOUN', 'PROPN', 'PROPN', 'VERB', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'ADV', 'VERB', 'NOUN', 'VERB', 'NOUN', 'SPACE']
['dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'dep', 'punct', 'dep', 'dep', 'ROOT', '']

Example repo with data here. Note that the issue does not seem to occur on Linux but only on Windows and only when using the GPU.

BramVanroy avatar Dec 03 '19 08:12 BramVanroy

@BramVanroy I think this is probably a cupy issue. Their disclaimer:

We recommend the following Linux distributions.

Ubuntu 16.04 / 18.04 LTS (64-bit) CentOS 7 (64-bit)

We are automatically testing CuPy on all the recommended environments above. We cannot guarantee that CuPy works on other environments including Windows and macOS, even if CuPy may seem to be running correctly.

adrianeboyd avatar Dec 03 '19 09:12 adrianeboyd

Figured so. Perhaps a (HUGE) disclaimer is welcome in the docs, then. Discouraging people from using the GPU on Windows. If you agree, I can create a pull request.

BramVanroy avatar Dec 03 '19 09:12 BramVanroy

I think a warning in the docs would be good. This hasn't come up before, so either it's rare for people to be using a GPU in windows or something changed in thinc/cupy to cause this. Can you see if it does work with slightly older versions of thinc and/or cupy? (We have no way to test this ourselves.)

adrianeboyd avatar Dec 03 '19 10:12 adrianeboyd

I think that indeed not a lot of people are using Windows (do you have any stats on this from PyPi perhaps? Would be interesting to see!), but also I don't think it is very well known that GPU-support is available, simply because the CPU-performance is so incredibly good. In my projects I use the CPU version and parallellize it and I never felt like I missed performance, so I never went looking for GPU support. I only recently bumped into it, tried it out on my home PC, and found out it didn't work.

When I find the time, I can dig into this deeper and try older versions. I think I tried down to v2.0 (which also didn't work) but I'll have to check. (It might be useful to re-open the linked topic so I can keep it updated rather than flooding this topic.)

BramVanroy avatar Dec 03 '19 10:12 BramVanroy

I think a new issue focused on windows + GPU would be useful. I didn't mean older versions of spacy, just older versions of cupy and maybe thinc (within the compatibility ranges, of course).

adrianeboyd avatar Dec 03 '19 10:12 adrianeboyd

Even though I would definitely like to see full blown GPU support for Windows, I'm not sure whether this is something that spaCy can fix if the problem lies in cupy? But if requested I can make a new issue, sure.

BramVanroy avatar Dec 03 '19 10:12 BramVanroy

We can't necessarily fix it, but if we (well, you) can figure out that a particular version of cupy works better, we can provide that information in the warning. A new issue could also help people with the same problem when they search in the tracker, since you're right that it's getting pretty off-topic here. (Maybe we can just move all these comments to a new issue?)

adrianeboyd avatar Dec 03 '19 13:12 adrianeboyd

I am having problems validating the accuracy of the nl_core_news_sm when I am trying to run it on the Lassy Small Test Dataset.

I see the model is trained on the same dataset and I am assuming its just the training data but I see the accuracy mentioned in the Github Release is 90.97% for tagging accuracy and when I try to run the same model on the same dataset, I am getting an accuracy of 75.12%.

Github: https://github.com/umarniz/spacy-validate-nl-model

Can you confirm if this is the same dataset the Spacy model is tested on?

Secondly, I was unable to find the code that is used to calculate the accuracy number that are attached in the Github Release. Is there a place where the code used to calculate those numbers is available?

I can imagine sharing the datasets they are run on might not be legal but the code could be useful for people who obtain the dataset themselves to validate :)

umarniz avatar Dec 09 '19 20:12 umarniz

@umarniz That looks like a mistake in the docs, sorry for the confusion! The NER model is trained on NER annotation done on top of UD_Dutch-LassySmall, but the tagger and parser are trained on UD_Dutch v2.0 (since renamed to UD_Dutch-Alpino).

Spacy calculates the accuracy using the internal scorer. You can run an evaluation on a corpus in spacy's training format using the command-line spacy evaluate model corpus.json, which runs nlp.evaluate() underneath.

You can convert a CoNLL-U corpus to spacy's training format with combined tag+morph tags using the command: spacy convert -n 10 -m file.conllu .

Be aware that the reported tagger evaluation is for token.tag not token.pos and is for the UD dev subcorpus rather than the test subcorpus. (It's somewhat confusing labelled POS on the website, but see the note in the mouseover.)

adrianeboyd avatar Dec 10 '19 08:12 adrianeboyd

More examples of hyphen being tagged incorrectly (like in 4974): pre-primary co-pay ex-wife ex-mayor de-ice

ewaldatsensentia avatar Feb 06 '20 18:02 ewaldatsensentia

Verb "to be" is being marked as AUX instead of VERB when it is actually the main verb.

If I use displacy on this sentence "I have been in Wuhan" why do I see the POS "AUX" on been? Isn't it a verb? https://explosion.ai/demos/displacy?text=I%20have%20been%20in%20Wuhan&model=en_core_web_lg&cpu=1&cph=1

It happens on all the pretrained English models

MartinoMensio avatar Feb 10 '20 16:02 MartinoMensio

I hope this is the right place to report some confusing behavior where spaCy 2.2.3 and en_core_web_md 2.2.5 on Python 3.7 seem to produce a different lemma and part-of-speech tag when a noun is capitalized at the beginning of a sentence. I've minimized an example with the word "time", but I have seen what appears to be the same issue with the words "psychoanalysis" and "interpretation", at least. This program:

import en_core_web_md

srcs = ["An historian employs most of these words at one time or another.",
        "Our first task is to understand our own times.",
        "Time is therefore that mediating order.", # problem!
        "Times are changing."]

nlp = en_core_web_md.load()

rslts = [[{"lemma": t.lemma_, "tag": t.tag_}
            for t in doc if "time" == t.norm_[0:4]][0]
         for doc in nlp.pipe(srcs)]

if __name__ == "__main__":
    import sys
    import json
    json.dump(rslts, sys.stdout, indent=1, sort_keys=True)
    sys.stdout.write("\n")

produces the following output:

[
 {
  "lemma": "time",
  "tag": "NN"
 },
 {
  "lemma": "time",
  "tag": "NNS"
 },
 {
  "lemma": "Time",
  "tag": "NNP"
 },
 {
  "lemma": "time",
  "tag": "NNS"
 }
]

I expected the use of "time" in the third sentence ("Time is therefore that mediating order.") to be lematized as "time" and tagged as "NN", consistent with the other examples.

LiberalArtist avatar Feb 11 '20 21:02 LiberalArtist

@LiberalArtist : The v2.2. models are using some new data augmentation to try to make them less case-sensitive, which leads to less certainty about NN vs. NNP distinctions, and for Time in particular, the training data includes lots of cases of Time as NNP from the magazine Time or Time Warner.

adrianeboyd avatar Feb 12 '20 10:02 adrianeboyd

@MartinoMensio: The POS tags are mapped from the fine-grained PTB tag set, which doesn't distinguish auxiliary verbs from main verbs. All verbs get mapped to VERB except for some exceptions below, where everything that might be an AUX gets mapped to AUX:

https://github.com/explosion/spaCy/blob/99a543367dc35b12aad00c4cd845ddd1f4870056/spacy/lang/en/morph_rules.py#L388-L487

This mapping is kind of crude, and we're working on statistical models for morphological tagging and POS tags to replace this.

adrianeboyd avatar Feb 12 '20 10:02 adrianeboyd

I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.

lingdoc avatar Mar 06 '20 02:03 lingdoc

I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.

You are right that this is probably not the place to discuss this. You may be interested in the wrappers spacy-stanfordnlp, spacy-udpipe, spacy-transformers, and probably more. Benepar has a plugin for spaCy, too. SpaCy has many benefits over SOTA models imo. It's not just a model, it's a whole framework with customization options, the ability to train your own model and so on. Next to that, the benefit for me is spaCy's speed that can easily be massed on CPU using multiprocessing, and the fact that is open-source, and actively developed. Many if not most models or implentation that achieve (near) SOTA have been developed, benchmarked, and forgotten. That means that you can't get any help, no community, no bug fixes. Active development is my key concern in all of these SOTA models and frameworks.

BramVanroy avatar Mar 06 '20 12:03 BramVanroy

Very cool! I completely agree that active development is a key concern, along with speed, right up there with accuracy of POS tags. I simply wasn't aware of the spacy-stanfordnlp wrapper or the benepar plugin, will definitely give them a look.

lingdoc avatar Mar 06 '20 13:03 lingdoc

Hi all - have been noticing lately the sentence boundary detection with the default parser on both small and medium models seems to be a bit more off than what I remembered in previous versions:

For example:

tmp_txt = """
He said: “Clearly there is a global demand for personal protective equipment at the moment and I know that government with our support is working night and day to ensure that we procure the PPE that we need.”

Turkey has sent 250,000 items of personal protective equipment to the UK which will now be distributed to medical centres around the country, according to the Ministry of Defence.

A delivery of 50,000 N-95 face masks, 100,000 surgical masks, and 100,000 protective suits arrived at RAF Brize Norton in Oxfordshire on Friday. Turkey has previously donated supplies to fellow Nato members Spain and Italy.

Ben Wallace, UK defence secretary, said the “vital equipment” from Ankara would bring protection and relief to thousands of critical workers across the UK.
"""

tmp_doc = nlp(tmp_txt)

print([sent for sent in tmp_doc.sents])

Messes up the SBD in the first sentence. Happens on both small and medium models (en_core_web_sm and en_core_web_md).

aced125 avatar Apr 12 '20 17:04 aced125

Interesting. It looks like the dependency parser doesn't handle conjoined clauses terribly well with a following 1st person pronoun. This is clearer with the raw text output:

print([x.text for x in tmp_doc.sents])

['\nHe said: “Clearly there is a global demand for personal protective equipment
 at the moment',  'and I know that government with our support is working night
 and day to ensure that we procure the PPE that we need.”\n\n', 'Turkey has sent
 250,000 items of personal protective equipment to the UK which will now be
 distributed to medical centres around the country, according to the Ministry of
 Defence.\n\n', 'A delivery of 50,000 N-95 face masks, 100,000 surgical masks,
 and 100,000 protective suits arrived at RAF Brize Norton in Oxfordshire on Friday.',
 'Turkey has previously donated supplies to fellow Nato members Spain and
 Italy.\n\n', 'Ben Wallace, UK defence secretary, said the “vital equipment” from
 Ankara would bring protection and relief to thousands of critical workers across
 the UK.\n']

In this case the mis-SBD seems to be caused by the second clause in the conjoined sentence starting with the pronoun I, which apparently is interpreted by the parser as the subject of a separate sentence rather than the subject of a conjoined clause. This is clearer from the following toy example:

tmp_txt = """The man had a dog who liked to run and he liked to chase the cat.
The man had a dog who liked to run and I liked to chase him.
The man had a dog who liked to run and I liked to chase the cat."""

tmp_doc = nlp(tmp_txt)

print([x.text for x in tmp_doc.sents])

gives:

['The man had a dog who liked to run and he liked to chase the cat.\n',
 'The man had a dog who liked to run', 'and I liked to chase him.\n',
 'The man had a dog who liked to run', 'and I liked to chase the cat.']

where the second and third sentences gets split because of the same pattern, even though sentence 2 has an anaphoric pronoun that refers to an element of the previous clause.

lingdoc avatar Apr 13 '20 01:04 lingdoc

I have noticed a few weird (incorrect) changes after upgrading from 2.0.18 to 2.2.4. Should I report those here? For example, the sentence make me a sandwich. I guess this is explainable by assuming it confuses cake as a noun vs cake as a verb

v2.0.18 Screen Shot 2020-04-23 at 6 18 52 PM

v2.2.4 Screen Shot 2020-04-23 at 6 19 15 PM

fersarr avatar Apr 23 '20 17:04 fersarr

@fersarr: It's useful to have these kinds of results here, thanks! Imperatives are a case where the provided models often perform terribly because there are very few imperatives in the training data. If you know you have an imperative sentence, it's hacky, but adding a subject like we or you at the beginning of a sentence can improve the analysis a lot. (See some discussion in #4744.)

It would be nice to extend our training data in areas where we know there are problems because most of the models are trained on more formal text like newspaper text, but we don't have any concrete plans in this area yet. (Some common problems are: questions, imperatives, 1st and 2nd person (informal) pronouns, female pronouns, etc.)

adrianeboyd avatar Apr 23 '20 19:04 adrianeboyd

@fersarr: It's useful to have these kinds of results here, thanks! Imperatives are a case where the provided models often perform terribly because there are very few imperatives in the training data. If you know you have an imperative sentence, it's hacky, but adding a subject like we or you at the beginning of a sentence can improve the analysis a lot. (See some discussion in #4744.)

It would be nice to extend our training data in areas where we know there are problems because most of the models are trained on more formal text like newspaper text, but we don't have any concrete plans in this area yet. (Some common problems are: questions, imperatives, 1st and 2nd person (informal) pronouns, female pronouns, etc.)

Thanks @adrianeboyd for the link to #4744 and the interesting idea to add we or you to the imperative. Unfortunately, it didn't change the outcome in this case 😞. I will think of alternatives

fersarr avatar Apr 24 '20 09:04 fersarr

just FYI, another regression below. However, this one seems to only have happened after 2.2.0 because the spacy visualizer demo (2.2.0) shows it correctly

v.2.0.18 Screen Shot 2020-04-24 at 5 52 08 PM

v.2.2.4 Screen Shot 2020-04-24 at 5 52 28 PM

fersarr avatar Apr 24 '20 16:04 fersarr

Noticed an example in which the small model fails but the medium model succeeds. murmured is incorrectly tagged as a proper noun starting the sentence in the example below (perhaps explaining the lack of lemmatization). When a noun phrase precedes it, it's correctly parsed.

Small Model:

import spacy

​nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")

sent = "murmured Nick in the library"
for token in nlp_sm(sent):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
murmured murmured PROPN NNP ROOT 

Nick Nick PROPN NNP dobj 

in in ADP IN prep 

the the DET DT det 

library library NOUN NN pobj 
import spacy
nlp_sm = spacy.load("en_core_web_sm")
​
sent = "I murmured in the library"
for token in nlp_sm(sent):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
I -PRON- PRON PRP nsubj 

murmured murmur VERB VBD ROOT 

in in ADP IN prep 

the the DET DT det 

library library NOUN NN pobj 

Medium Model

import spacy
​
sent = "murmured Nick in the library"
for token in nlp_md(sent):
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, "\n")
murmured murmur VERB VBD ROOT 

Nick Nick PROPN NNP npadvmod 

in in ADP IN prep 

the the DET DT det 

library library NOUN NN pobj 
spacy                     2.2.3            py37ha1b3eb9_0    conda-forge

beckernick avatar May 19 '20 22:05 beckernick

Hello there,

I am wondering if there is a way to force the POS tagger to treat tokens as non-verbs in order to not mess up the dependency parser. In my case, I have as input a long list of noun chunks, hence no verbs are expected to occur in my input. I noticed that for some cases the POS tagger gets confused:

import spacy

nlp = spacy.load('en_core_web_lg')
chunks = ['reading light', 'flashing light']

for chunk in chunks:
    doc = nlp(chunk)
    for token in doc:
        print(token.text, token.dep_, token.tag_)
    print('-'*10) 

yields:

reading ROOT VBG
light dobj NN
----------
flashing ROOT VBG
light dobj NN

while the expected output would be that in both chunks the ROOT is "light". So, can I hint the tagger that I am giving it something that can't be verb-ish? That way the parser would not fail, I presume.

Thanks!

stelmath avatar Jun 04 '20 11:06 stelmath

I'm not sure if this is the place to ask, but I'm wondering, given the state of the art for POS tagging as reported by: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) - is there a particular reason that spaCy uses its own trained models rather than wrappers for existing models that report better accuracy numbers? I understand their reported accuracy might be on a completely different set of benchmarks, but have they been evaluated on spaCy's benchmarks? Are there licensing restrictions that keep them from being integrated as POS taggers? A parallel might be the vast number of pretrained language/embedding models implemented by HuggingFace's or TensorFlow Hub's repos, many of which are developed by people not directly associated with the repos themselves.

My understanding is that spaCy defaults to trading some accuracy for speed. Using the defaults means you get that compromise too. But spaCy is very hackable, so you can BYOM. They also link to (and I think, made) spaCy Stanza which is one way to use fancier models that are slower but more accurate.

sam-writer avatar Jul 20 '20 22:07 sam-writer

All three German language models do not recognize Karlsruhe (a city) as LOC, but smaller cities.

import spacy
nlp_de = spacy.load('de_core_news_lg')
doc_de = nlp_de('Ettlingen liegt bei Karlsruhe.')
for entity in doc_de.ents:
    print(entity.text, entity.label_)

Result:

Ettlingen LOC
Karlsruhe MISC

andreas-wolf avatar Aug 05 '20 13:08 andreas-wolf

I noticed that the NER makes some mistakes when tagging text with money depending on the numerical value and I wonder if you could do something about it when training the next version of the models.

For instance, if I run the following notebook cell

import spacy
nlp = spacy.load('en_core_web_lg')


symbols = ["$", "£", "€", "¥"]
for symbol in symbols:
    print("-----------------------")
    print("Symbol: {}".format(symbol))
    print("-----------------------")
    for j in range (0, 2):
        for i in range (1, 10):
            text = nlp(symbol + str(j) + '.' + str(i) + 'mn favourable variance in employee expense over the forecast period')
            print (str(j) + '.' + str(i))
            for ent in text.ents:
                print(ent.text, ent.start_char, ent.end_char, ent.label_)

I get the following results, where

  • for $, spaCy gets most of the stuff correctly (but notice that nothing is output for 1.5),
  • for £, it misses a few and also mistags one as ORG,
  • for € it misses quite more and,
  • for ¥ in addition to missing several, it doesn't get any of them as MONEY...
-----------------------
Symbol: $
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.2mn 1 6 MONEY
0.3
0.3mn 1 6 MONEY
0.4
0.4mn 1 6 MONEY
0.5
0.5mn 1 6 MONEY
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.8mn 1 6 MONEY
0.9
0.9mn 1 6 MONEY
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.3mn 1 6 MONEY
1.4
1.4mn 1 6 MONEY
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.8mn 1 6 MONEY
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: £
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.2mn 1 6 MONEY
0.3
0.3mn 1 6 MONEY
0.4
0.4mn 1 6 ORG
0.5
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.9
0.9mn 1 6 MONEY
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.3mn 1 6 MONEY
1.4
1.4mn 1 6 MONEY
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.8mn 1 6 MONEY
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: €
-----------------------
0.1
0.1mn 1 6 MONEY
0.2
0.3
0.3mn 1 6 MONEY
0.4
0.5
0.6
0.6mn 1 6 MONEY
0.7
0.7mn 1 6 MONEY
0.8
0.9
1.1
1.2
1.2mn 1 6 MONEY
1.3
1.4
1.5
1.6
1.6mn 1 6 MONEY
1.7
1.7mn 1 6 MONEY
1.8
1.9
1.9mn 1 6 MONEY
-----------------------
Symbol: ¥
-----------------------
0.1
0.2
0.3
0.4
¥0.4mn 0 6 ORG
0.5
¥0.5mn 0 6 ORG
0.6
¥0.6mn 0 6 GPE
0.7
0.8
¥0.8mn 0 6 ORG
0.9
¥0.9mn 0 6 ORG
1.1
¥1.1mn 0 6 ORG
1.2
1.3
1.4
1.5
1.6
¥1.6mn 0 6 ORG
1.7
1.8
¥1.8mn 0 6 NORP
1.9
¥1.9mn 0 6 CARDINAL

Info about model 'en_core_web_lg'

(...) version 2.3.1
spacy_version >=2.3.0,<2.4.0

barataplastica avatar Oct 05 '20 14:10 barataplastica

Facing an interesting issue with the large (and small) pre trained models

import spacy nlp = spacy.load('en_core_web_lg')

text = "POL /hi there abcde ffff" doc = nlp(text) for ent in doc.ents: ... print(ent.text, ent.start_char, ent.end_char, ent.label_) ... POL 0 3 PERSON

if I remove the "/" it does not detect any entities (the expected behavior) any idea why the leading slash throws NER off?

This is observed using spacy 2.2.3

Btw I also tried with the spacy nightly and TRF model, the issue does not exist with the transformer model

abh1nay avatar Nov 13 '20 01:11 abh1nay

Spacy 3.0 tags “tummy” as a determiner in “tummy ache”. I think this is serious, since determiners are closed class words and arguably there should be literally half a dozen of them in English language and never more. Tagging any content word as a determiner is likely to cause many apps to ignore it.

Model: en_core_web_lg-3.0.0

In [1]: import spacy; nlp = spacy.load('en_core_web_lg')

In [2]: [tok.tag_ for tok in nlp('tummy ache')]
Out[2]: ['DT', 'NN']

adam-ra avatar Feb 11 '21 08:02 adam-ra

Here's a perverse case. Is Will a first name or an AUX MD? spaCy 3.0.1 with en_core_web_sm gets confused.

nlp = spacy.load('en_core_web_sm')
doc = nlp("Will Will Shakespeare write his will?")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Will will AUX MD aux Xxxx True True
Will will AUX MD aux Xxxx True True
Shakespeare shakespeare VERB VB nsubj Xxxxx True False
write write VERB VB ROOT xxxx True False
his his PRON PRP$ poss xxx True True
will will NOUN NN dobj xxxx True True
? ? PUNCT . punct ? False False

gitclem avatar Feb 12 '21 02:02 gitclem

Here's another perverse case. Is May a first name or an AUX MD? spaCy 3.0.1 with en_core_web_sm gets confused.

nlp = spacy.load('en_core_web_sm')
doc = nlp("May May celebrate May Day with us?")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

May May PROPN NNP ROOT Xxx True True
May may AUX MD aux Xxx True True
celebrate celebrate VERB VB ROOT xxxx True False
May May PROPN NNP compound Xxx True True
Day Day PROPN NNP npadvmod Xxx True False
with with ADP IN prep xxxx True True
us we PRON PRP pobj xx True True
? ? PUNCT . punct ? False False

However, this similar sentence treats May differently:

doc = nlp("May May come over for dinner?")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

May may AUX NNP aux Xxx True True
May may AUX MD aux Xxx True True
come come VERB VB ROOT xxxx True False
over over ADP RP prt xxxx True True
for for ADP IN prep xxx True True
dinner dinner NOUN NN pobj xxxx True False
? ? PUNCT . punct ? False False

and of course, there are a number of women's names that could cause a parser fits (Tom Switfy):

  • Charity
  • Chasity
  • Faith - fails: "Faith has faith."
  • Grace - this works: "Grace has grace."
  • Hazel - this works: "Hazel has hazel eyes."
  • Hope - this works: "Hope has no hope."
  • ivy - this works: "Ivy climbed the ivy."
  • Prudence
  • Scarlet - this works "Scarlet turned scarlet."

and some men's names:

  • Bill - works: "Bill bills for work he did."
  • Bob - fails: "Bob bobs for apples"
  • Grant - fails: "Grant grants grants." but this works: "Ulysses Grant grants grants."
  • Jack - works "Jack has a jack."
  • John - works: "John went to the john."
  • Miles - fails "Miles went for miles." but this works: "Miles Standish went for miles."
  • River - fails: "River drank from the river." but this works: "River Phoenix drank from the river."

gitclem avatar Feb 12 '21 02:02 gitclem

While upgrading from spaCy 2.3.1 to 3.0.0, I've noticed that some person entities are no longer detected:

import en_core_web_lg
nlp = en_core_web_lg.load()
doc = nlp('I am not Mr Foo.')
print(doc.ents) 
# (Mr Foo,) with 2.3.1
# () with 3.0.0

paulbriton avatar Feb 12 '21 18:02 paulbriton

The Multilingual model xx_sent_ud_sm does not tokenize correctly Chinese sentences, while the Chinese model zh_core_web_sm does. For example:

import spacy

nlp_ml = spacy.load("xx_sent_ud_sm")
nlp_ml.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。")
# ['包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究', '。']

nlp_zh= spacy.load("zh_core_web_sm")
nlp_zh.tokenizer("包括联合国机构和机制提出的有关建议以及现有的外部资料对有关国家进行筹备性研究。")
# ['包括', '联合国', '机构', '和', '机制', '提出', '的', '有关', '建议', '以及', '现有', '的', '外部', '资料', '对', '有关', '国家', '进行', '筹备性', '研究', '。']

SpaCy version is 3.0.0

Riccorl avatar Mar 04 '21 10:03 Riccorl

@Riccorl : This is the expected behavior for the base xx tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a mistake to include the Chinese or Japanese training corpora in the xx_sent_ud_sm 3.0.0 model. They'll be omitted in the next release.

The zh_core_web_sm model uses a completely separate tokenizer based on pkuseg to do word segmentation.

adrianeboyd avatar Mar 04 '21 10:03 adrianeboyd

@Riccorl : This is the expected behavior for the base xx tokenizer used in that model, which just doesn't work for languages without whitespace between tokens. It was a mistake to include the Chinese or Japanese training corpora in the xx_sent_ud_sm 3.0.0 model. They'll be omitted in the next release.

The zh_core_web_sm model uses a completely separate tokenizer based on pkuseg to do word segmentation.

Clear. Thank you for the explanation.

Riccorl avatar Mar 04 '21 10:03 Riccorl

There is a consistent issue with two-word adjectives in English language. Hyphenated two-word adjectives with prepositions are tokenized apart and the POS tagging model is unable to recognize them as an adjective. This causes the model to fail when extracting noun chunks: image To my understanding on-board should be identified as ADJ and its dependency to charger should be amod, which would not break the noun chunk dependency tree.

Another example: image

Two-word adjectives not starting with prepositions are properly detected: image image

I tried the following pipelines and all have the same issue:

  • en_core_web_sm
  • en_core_web_lg
  • en_core_web_trf

ezorita avatar Apr 19 '21 08:04 ezorita

@ezorita Had a quick look at the training data (OntoNotes) and it looks like "for-profit" is consistently annotated as a prepositional phrase with three tokens, while "non-profit" is a single token adjective. So it looks like this may just be a quirk of our training data.

polm avatar Apr 19 '21 09:04 polm

Hi, I hope this is the right place to ask about bad parses in spaCy v3.0.6. The first two parses I tried were both parsed disturbingly incorrectly: "John eats cake": "cake" is parsed as a verb. It's not possible for any verb to appear there in this sentence, and cake is rarely used as a verb at all (other than caking of sand, etc). This affects both en_core_web_sm and en_core_web_md, but not en_core_web_lg. image "John eats salad": "salad" is parsed as a verb, and "eats" is parsed as an auxiliary (!!!). This affects only en_core_web_sm and not en_core_web_md or en_core_web_lg. Similar to the above comment https://github.com/explosion/spaCy/issues/3052#issuecomment-777275069, auxiliaries are a closed class in English and should really not ever apply to the verb eats. image

hanryhu avatar Jun 25 '21 19:06 hanryhu

@hanryhu 😨

Thanks for the example, that definitely looks wrong. I wonder what's going on there, hm. I doubt salad is even an unseen word!

honnibal avatar Jun 26 '21 06:06 honnibal

Hi, I got another example of a word that is incorrectly labelled into a closed class: in particular, the word near seems to always be parsed as SCONJ. Why might misparses like this be introduced in spaCy 3?

image

hanryhu avatar Jul 01 '21 22:07 hanryhu

There are some tokenization inconsistencies with French models with some common sentence structures, such as questions (inversion VERB then nsubj)

For instance, the following grammatical sentence in French "La librairie est-elle ouverte ?" (is the bookshop open ?) is tokenized as :

'La' 'librairie' 'est-elle' 'ouverte' '?'
'DET' 'NOUN' 'PRON' 'ADJ' 'PUNCT'

when it really should be:

'La' 'librairie' 'est -elle' 'ouverte' '?'
'DET' 'NOUN' 'VERB' 'PRON' 'ADJ' 'PUNCT'

Sometimes, there is no problem, as in "La libraire veut-elle des bonbons ?" (Does the bookseller want candy?), which give, as expected :

'La' 'libraire' 'veut' '-elle' 'des' 'bonbons' '?'
'DET' 'NOUN' 'VERB' 'PRON' 'DET' 'NOUN' 'PUNCT'

(Model used for generating those examples is fr_dep_news_trf )

dorianve avatar Jul 28 '21 20:07 dorianve

For this sentence, spaCy tags "down" as ADV. But it seems that "down" should be tagged as ADP, since "down" is movable around the object.

(postag) lab-workstation:$ cat test_spaCy.py 
import spacy
from spacy.tokens import Doc
 
class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
 
  def __call__(self, text):
      words = text.split(" ")
      spaces = [True] * len(words)
      # Avoid zero-length tokens
      for i, word in enumerate(words):
          if word == "":
              words[i] = " "
              spaces[i] = False
      # Remove the final trailing space
      if words[-1] == " ":
          words = words[0:-1]
          spaces = spaces[0:-1]
      else:
          spaces[-1] = False
      return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_trf', exclude=['lemmatizer', 'ner'], )
 
sen = 'Hold the little chicken down on a flat surface .'
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(sen)
 
for token in doc:
    print(token.i, token.text, token.pos_, token.dep_, token.head, token.head.i)

(postag) lab-workstation:$  python test_spaCy.py 
0 Hold VERB ROOT Hold 0
1 the DET det chicken 3
2 little ADJ amod chicken 3
3 chicken NOUN dobj Hold 0
4 down ADV advmod Hold 0
5 on ADP prep Hold 0
6 a DET det surface 8
7 flat ADJ amod surface 8
8 surface NOUN pobj on 5
9 . PUNCT punct Hold 0

spaCy version: 3.1.0 Platform: Linux-4.15.0-48-generic-x86_64-with-debian-buster-sid Python version: 3.7.10 Pipelines: en_core_web_trf (3.1.0), en_core_web_sm (3.1.0)

muchang avatar Sep 04 '21 02:09 muchang

I have a complaint about Portuguese model as well. The expression um irmão meu is parsed wrong by pt_core_news_md. pt_core_news_sm functions correct. Here's output of medium model for dependency parse, there are 2 ROOTs:

>>> [token.dep_ for token in doc]
['det', 'ROOT', 'ROOT']

Second ROOT should be det.

DuyguA avatar Oct 28 '21 12:10 DuyguA

Strange behavior with alignment and transformers. In the following text , and changing it a bit, the error disappears, the word "dog" in the second sentence, is aligned with two token-parts, but the second one is speed, the next word

spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
text= "Mariano jumps over a lazy dog. The dog speed is 20 km / h. "
doc =nlp(text) 
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
for tok, parts in zip(doc,align):
  list=[x for  y in parts.data for x in y]
  print (tok.text,parts.lengths,list  ,'|'.join([tokens['input_texts'][0][part] for part in list]) )

produces this output Mariano [3] [1, 2, 3] M|arian|o jumps [1] [4] Ġjumps over [1] [5] Ġover a [1] [6] Ġa lazy [1] [7] Ġlazy dog [1] [8] Ġdog . [1] [9] . The [1] [10] ĠThe dog [2] [11, 12] Ġdog|Ġspeed speed [1] [12] Ġspeed is [1] [13] Ġis 20 [1] [14] Ġ20 km [1] [15] Ġkm / [1] [16] Ġ/ h. [2] [17, 18] Ġh|.

Maybe, should it be introduced as a bug?

joancf avatar Oct 29 '21 14:10 joancf

Strange behavior with alignment and transformers. In the following text , and changing it a bit, the error disappears, the word "dog" in the second sentence, is aligned with two token-parts, but the second one is speed, the next word

spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
text= "Mariano jumps over a lazy dog. The dog speed is 20 km / h. "
doc =nlp(text) 
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
for tok, parts in zip(doc,align):
  list=[x for  y in parts.data for x in y]
  print (tok.text,parts.lengths,list  ,'|'.join([tokens['input_texts'][0][part] for part in list]) )

produces this output Mariano [3] [1, 2, 3] M|arian|o jumps [1] [4] Ġjumps over [1] [5] Ġover a [1] [6] Ġa lazy [1] [7] Ġlazy dog [1] [8] Ġdog . [1] [9] . The [1] [10] ĠThe dog [2] [11, 12] Ġdog|Ġspeed speed [1] [12] Ġspeed is [1] [13] Ġis 20 [1] [14] Ġ20 km [1] [15] Ġkm / [1] [16] Ġ/ h. [2] [17, 18] Ġh|.

Maybe, should it be introduced as a bug?

FYI- the G unicode character is:

U+0120 : LATIN CAPITAL LETTER G WITH DOT ABOVE

My guess is the 0x20 part is for a space and (wilder guess) is the 0x01 might be the length of the space.

gitclem avatar Oct 29 '21 20:10 gitclem

The Ġ I don't care. I think is part of the tokenizer, i did not check You should realize that dog has two parts [11, 12] while speed has also [12] meaning that the the word part 12 is duplicated

If you change the first sentence (just remove "lazy") then the result seems correct

text= "Mariano jumps over a dog. The dog speed is 20 km / h. "

Mariano [3] [1, 2, 3] M|arian|o
jumps [1] [4] Ġjumps
over [1] [5] Ġover
a [1] [6] Ġa
dog [1] [7] Ġdog
. [1] [8] .
The [1] [9] ĠThe
dog [1] [10] Ġdog
speed [1] [11] Ġspeed
is [1] [12] Ġis
20 [1] [13] Ġ20
km [1] [14] Ġkm
/ [1] [15] Ġ/
h. [2] [16, 17] Ġh|.

joancf avatar Oct 29 '21 21:10 joancf

@joancf : I'm pretty sure this is an unfortunate side effect of not having a proper encoding for special tokens or special characters in the transformer tokenizer output. In the tokenizer output it looks identical when there was <s> in the original input and when the tokenizer itself has inserted <s> as a special token.

To align the two tokenizations, we're using an alignment algorithm that doesn't know anything about the special use of Ġ, but it does know about unicode and that this character is related to a g, so I think ends up aligned like this because of the final g in dog. If you replace g with a different letter, it's not aligned like this.

We've also had issues with <s> being aligned with the letter s and since the tokenizer does have settings that show what its special tokens are, we try to ignore those while aligning when possible, but the various tokenizer algorithms vary a lot and in the general case we don't know which parts of the output are special characters.

If we want to drop support for slow tokenizers, I think we can potentially work with the alignment returned by the tokenizer, but we haven't gotten it to work consistently in the past and for now we're using this separate alignment method. I suspect this kind of alignment happens relatively rarely and doesn't affect the final annotation too much.

adrianeboyd avatar Nov 01 '21 08:11 adrianeboyd

Hi @adrianeboyd, I think i can survive with this strange punctual issue, but when it comes to longer sentences the parts seem totally broken. It seems (please correct me if I'm wrong) that to process a long text internally, the nlp pipe splits it into blocks of ~145 original tokens (I don't know how to increase/modify that number) and the results in trf_data is also on these blocks (the size of all blocks is the same but it changes from document to document to the maximum number WordParts in any of them) But trf.align is a single ragged Array having the same size as tokens has the document.

In the next code... i tried with a long text:

text="""  This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
And of course one would expect that the word sentence should have the same embedding in each sentence.
Well, maybe slightly different but not very different, so we can chen similarity.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
And of course one would expect that the word sentence should have the same embedding in each sentence.
Well, maybe slightly different but not very different, so we can chen similarity.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
This is a test sentence to check how spacy processes long texts and how to extract embbedings from them.
"""
spacyModel="en_core_web_trf"
nlp = spacy.load(spacyModel)
doc =nlp(text)
toks=[tok for tok in doc]
print(f"tokens length {len(toks)}")
align=doc._.trf_data.align
tokens=doc._.trf_data.tokens
trf=doc._.trf_data.tensors
print (f" number of data wordParts  {len(align)} distributed in  {len(tokens['input_texts'])} chunks of size  {len(tokens['input_texts'][0])} and transfomers shape {trf[0].shape}  ")
# we can flatten the inputs and tensors(in ndarray not tensors), to apply alignment more easily
x,y,z=trf[0].shape
trf[0].shape=(1,-1,z)
print(trf[0].shape)
inputs = [x for ins in tokens['input_texts'] for x in ins]
print(f"size of flatten inputs and  tensors: {len(inputs)} , {trf[0].shape}")
for tok, parts in zip(toks,align):
  list=[x for  y in parts.data for x in y]
  print (tok.text, list  ,'|'.join(inputs[part] for part in list) )

produces these outputs

 number of data wordParts  369 distributed in  4 chunks of size  147 and transfomers shape (4, 147, 768)  
(1, 588, 768)
size of flatten inputs and  tensors: 588 , (1, 588, 768)
   [] 
This [2] ĠThis
is [3] Ġis
a [4] Ġa
test [5] Ġtest
sentence [6] Ġsentence
to [7] Ġto
check [8] Ġcheck
how [9] Ġhow
spacy [10, 11] Ġsp|acy
processes [12] Ġprocesses
long [13, 14] Ġlong|Ġtexts
texts [14] Ġtexts
.... more stuff here...
sentence [107] Ġsentence
should [108] Ġshould
have [109, 148] Ġhave|have
the [110, 149] Ġthe|Ġthe
same [111, 150] Ġsame|Ġsame
embedding [112, 113, 114, 151, 152] Ġembed|ding|Ġin|Ġembed|ding
in [114, 153] Ġin|Ġin
each [115, 154] Ġeach|Ġeach
sentence [116, 155] Ġsentence|Ġsentence
. [117, 156] .|.
......
. [135, 174] .|.

 [] 
This [137, 176] This|This
is [138, 177] Ġis|Ġis
a [139, 178] Ġa|Ġa
test [140, 179] Ġtest|Ġtest
sentence [141, 180] Ġsentence|Ġsentence
to [142, 181] Ġto|Ġto
check [182] Ġcheck
how [183] Ġhow
spacy [184, 185] Ġsp|acy
processes [186] Ġprocesses
long [187] Ġlong
texts [188] Ġtexts

So, the parts of "long" are wrong (the error mentioned before) But after the position 109 (have) the set of token parts present duplications, but also a distance of 40 tokens between parts, it seems is confusing one sentence and the next one, an error that happens up to position 142, where it jumps to 182 and seems to continue with correct results (single token part) , up to token 403 where the same behavior appears.

[Edited] It is not a bug, I answer myself. The reason this happens is that batches include the last part of the previous segment to give context. So Last words in a segment appear twice, and so, they also have two embeddings. E.g. the word in position 109 is dupicated in the next segment (starting at 148) and this is hapens for the next 40 word parts, So embedding of "have" should be the "average" of the embeddings [109, 148]

This means that nearly 1/4 (~40/160) of the tokens are processed twice

joancf avatar Nov 04 '21 15:11 joancf

de_core_news_sm does not correctly infer sentence boundaries for gendered sentences in German. In German, a current trend is to gender plurals of a word to make clear that women are included (since the generic plural is masculine). E.g. 'Kunde' (customer) becomes 'Kund:innen' (or 'Kund*innen' or 'Kund_innen' or 'KundInnen').

The sentence 'Selbstständige mit körperlichen Kund:innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.' gets split up in two sentences on the colon, even though it is one.

To reproduce the issue:

import spacy
nlp = spacy.load("de_core_news_sm")
sentence = 'Selbstständige mit körperlichen Kund:innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.'
print(list(nlp(sentence).sents))

yields [Selbstständige mit körperlichen Kund:, innenkontakt sind ebenfalls dazu verpflichtet, sich mindestens zweimal pro Woche einem PoC - Test zu unterziehen.] (So two instead of one sentence).

de_core_news_md handles this correctly.

d-e-h-i-o avatar Nov 06 '21 13:11 d-e-h-i-o

The core en_core_web_sm-3.0.0 seems to have trouble detecting organization entities if the brand contains some sort of pronoun \ ownership (my). Even when directly calling it a company to supply context. Sometimes it thinks it is a cardinal, other times money.

Screen Shot 2021-11-16 at 4 48 06 PM

Or in the demo:

Screen Shot 2021-11-16 at 4 49 19 PM

Passing it any sentence with brand names that contain this language appears to introduce a lot of consistency issues.

WalrusSoup avatar Nov 17 '21 00:11 WalrusSoup

zh_core_web_trf is not detecting sentence boundaries correctly in Chinese.

nlp = spacy.load("zh_core_web_trf")
doc = nlp("我是你的朋友。你是我的朋友吗?我不喜欢喝咖啡。")

This should be three separate sentence, but the sents property only has one sentence.

peterolson avatar Nov 19 '21 19:11 peterolson

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

The error candidates need to be removed from the default list of stop words, please see attached spreadsheet, which contains both Norwegian bokmål, English, if it is an error candidate, and a short comment about why.

While infecting the default list of stop words could be considered an attack vector, a way of "poisoning the well", this is probably due to a local stop word list having been committed to the central repository at some time by someone.

Below are the steps needed to reproduce the list of stop words in Norwegian bokmål.

Stop words in Norwegian Bokmål

# Import Spacy

import spacy

# Import Norwegian bokmål from Norwegian language

from spacy.lang.nb import Norwegian

# Importing stop words from Norwegian bokmål language

spacy_stopwords = spacy.lang.nb.stop_words.STOP_WORDS

# Printing the total number of stop words:

print('Default number of stop words in Norwegian bokmål in Spacy: %d' % len(spacy_stopwords))

# Printing stop words:

print('Default stop words in Norwegian bokmål in Spacy: %s' % list(spacy_stopwords)[:249])

Default stop words in Norwegian bokmål in Spacy: ['har', 'fjor', 'dem', 'får', 'oss', 'det', 'gikk', 'svært', 'tillegg', 'fem', 'fram', 'noe', 'ifølge', 'kontakt', 'og', 'få', 'ut', 'blant', 'fikk', 'være', 'mellom', 'videre', 'tyskland', 'der', 'tid', 'mot', 'bak', 'mål', 'ikke', 'laget', 'saken', 'landet', 'utenfor', 'bris', 'hennes', 'kom', 'seks', 'ha', 'hva', 'leder', 'å', 'denne', 'gjør', 'regjeringen', 'del', 'sted', 'man', 'funnet', 'prosent', 'bare', 'satt', 'gå', 'menn', 'tirsdag', 'nok', 'vært', 'her', 'en', 'ser', 'fredag', 'veldig', 'at', 'også', 'komme', 'først', 'kort', 'annen', 'gjennom', 'nye', 'når', 'kunne', 'annet', 'oslo', 'igjen', 'skulle', 'frankrike', 'i', 'et', 'klart', 'land', 'henne', 'meg', 'kveld', 'uten', 'president', 'drept', 'fire', 'kroner', 'under', 'fotball', 'fortsatt', 'ta', 'gjort', 'var', 'blir', 'politiet', 'av', 'fra', 'etter', 'sett', 'eller', 'bedre', 'inn', 'mens', 'andre', 'ny', 'på', 'til', 'ligger', 'helt', 'personer', 'ingen', 'ved', 'god', 'ville', 'and', 'vant', 'kvinner', 'som', 'politidistrikt', 'tror', 'slik', 'tre', 'tatt', 'løpet', 'store', 'viktig', 'kl', 'siste', 'måtte', 'like', 'for', 'flere', 'lørdag', 'millioner', 'allerede', 'usa', 'mars', 'seg', 'mannen', 'samme', 'sier', 'stor', 'mandag', 'jeg', 'noen', 'mange', 'mennesker', 'hvorfor', 'vi', 'ja', 'ntb', 'år', 'dette', 'beste', 'neste', 'står', 'litt', 'kampen', 'by', 'nå', 'sa', 'selv', 'vil', 'mye', 'gang', 'opp', 'bli', 'ble', 'er', 'godt', 'siden', 'russland', 'de', 'la', 'ett', 'stedet', 'før', 'norske', 'om', 'opplyser', 'ham', 'ned', 'kommer', 'rundt', 'tilbake', 'du', 'hans', 'kamp', 'minutter', 'gjøre', 'gjorde', 'september', 'den', 'sitt', 'sammen', 'hvor', 'to', 'så', 'han', 'sin', 'samtidig', 'viser', 'da', 'dag', 'grunn', 'alle', 'norge', 'msci', 'fått', 'hele', 'går', 'men', 'mener', 'norsk', 'se', 'ønsker', 'gi', 'hun', 'disse', 'hadde', 'plass', 'både', 'alt', 'torsdag', 'første', 'skal', 'må', 'søndag', 'kan', 'vår', 'senere', 'langt', 'tok', 'folk', 'dermed', 'med', 'mer', 'sverige', 'blitt', 'poeng', 'enn', 'over', 'runde', 'sine', 'tidligere', 'skriver', 'onsdag', 'hvordan'] ` 2021-12-06 NLP Spacy - stop words in Norwegian bokmål model - error candidates.xlsx

EDIT: Wow, these stop word errors have been in the Norwegian bokmål file since 2017! o_O See https://github.com/explosion/spaCy/blob/f46ffe3e893452bf0c171c6c7fcf3b0e458c8f9e/spacy/lang/nb/stop_words.py

HaakonME avatar Dec 06 '21 13:12 HaakonME

Hi @HaakonME !

There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

I do want to point out that we don't typically recommend filtering out stop words, as with today's modern neural network approaches this is rarely needed or even useful. That said, some users do rely on them for various preprocessing needs, and I definitely agree with you that they should not contain meaningful words.

Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

If you would feel up to the challenge, we'd appreciate a PR to address some of the most obvious mistakes in the stop word lists. Ideally, that PR should be based off of our develop branch, because we consider changing the current stop words as slightly breaking, and would keep the change for 3.3 (in contrast, the current master branch will power the next 3.2.1 release).

If you need help for creating the PR, I could recommend reading the section over at https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md#getting-started and we can try to guide you as well :-)

svlandeg avatar Dec 06 '21 16:12 svlandeg

Hi @svlandeg !

I have proposed a change to remove NER words from Norwegian stop words in the develop branch as suggested. :-)

Cheers, Haakon

HaakonME avatar Dec 07 '21 06:12 HaakonME

@peterolson Sorry for the late reply, but thanks for reporting this. It does seem that the zh trf model really avoids recognizing short sentences.

We took a look at our training data (OntoNotes) and didn't find anything obviously wrong, but we'll keep looking at it.

polm avatar Dec 07 '21 07:12 polm

Spanish tokenization is broken when there is no space between question sentences "?¿"

nlp = spacy.load("es_dep_news_trf")
doc = nlp("¿Qué quieres?¿Por qué estás aquí?")

quieres?¿Por is treated as one token, but there should be a sentence boundary between "?" and "¿", and "quieres" and "Por" should be separate tokens.

peterolson avatar Dec 07 '21 16:12 peterolson

NER recognizing 's as an entity in the en_core_web_sm and en_core_web_lg models. Example below:

import spacy
content = """3 WallStreetBets Stocks to Trade | Markets Insider InvestorPlace - Stock Market News, Stock Advice & Trading Tips
What’s the next big thing on Wall Street? These days it might just be what’s trending or more specifically, receiving big-time mentions on WallStreetBets. Or not. The name in question might already be a titan of commerce.
Today let’s take a look at three such WallStreetBets stocks’ price charts and determine what’s technically hot and what’s not for your portfolio.
Reddit’s r/WallStreetBets. The seat-of-your-pants trading forum has made quite the name for itself in 2021. But you don’t need me to tell you that, right? Right."""

nlp_sm = spacy.load("en_core_web_sm")
nlp_md = spacy.load("en_core_web_md")
nlp_lg = spacy.load("en_core_web_lg")

nlp_sm(content).ents
Out[16]: (3, Stock Advice & Trading Tips, Today, ’s, three, 2021)

nlp_md(content).ents
Out[17]: (3, Stock Advice & Trading Tips, Today, three, 2021)

nlp_lg(content).ents
Out[18]: (3, These days, Today, ’s, three, Reddit, 2021)

Version Info:

pip list | grep spacy
spacy                             3.0.6
spacy-alignments                  0.8.3
spacy-legacy                      3.0.8
spacy-stanza                      1.0.0
spacy-transformers                1.0.2

narayanacharya6 avatar Dec 27 '21 16:12 narayanacharya6

@narayanacharya6 Cannot reproduce with 3.2. Can you upgrade and try again? Also include your model versions (spacy info).

Note that ’s and 's are not the same, and the non-ASCII version is probably not in our training data. I suspect we fixed this with character augmentation at some point.

polm avatar Dec 28 '21 04:12 polm

Outputs in previous comment were based on model version 3.0.0. Tried version 3.2.0 - and ’s is no longer identified as entity. Thanks!

narayanacharya6 avatar Dec 28 '21 15:12 narayanacharya6

For the German sentence "Die Ärmel der Strickjacke haben am Armabschluss ein Bündchen." in v3.2.1 "Die Ärmel" is parsed as Fem Singular instead of Masc Plural; in v3.1.4 the determiner "Die" was correctly parsed as Masc Plural ("Case=Nom|Definite=Def|Gender=Masc|Number=Plur|PronType=Art").

For the English sentence "Kennedy got killed.", "got" is lemmatized to "got" instead of "get".

cyriaka90 avatar Jan 31 '22 18:01 cyriaka90

Sorry for posting an unrelated point here, but I could not figure out a better place. Is there a reference to the model architecture / training code for the public models published by Spacy (e.g. 'en_core_web_md'). I looked at the spacy model repo, but that has models files and meta information, not the actual training code.

saurav-chakravorty avatar Feb 24 '22 04:02 saurav-chakravorty

@saurav-chakravorty If you have a question it's better to open a new Discussion than to post in an unrelated thread.

The training code is not public, partly because the training data requires a license (like OntoNotes for English), partly because a lot of it is infra-related and not of public interest.

polm avatar Feb 24 '22 06:02 polm

Stop words in Spanish contain many significant words

The list of stop words in Spanish contains many not very frequent verb forms and unusual words. Compared to the English list, there are many more words and many of them seem meaningful. It's a very strange selection.

  • many verb forms meaning say and similar, shouldn't at least be lemmatized?: afirmó agregó añadió indicó informó... (said added reported...)

Meaningful not very frequent words:

  • antaño empleo ejemplo lugar país días raras... (in olden days, employment, example, place, country, days, strange (plural feminine, but not any other of that adjective forms)

It contains even misspelled words (and the kind of misspell which are not frequent):

  • ampleamos from empleamos (we employ)
  • arribaabajo from arriba plus abajo (up down)
  • gueno from bueno (this might be more frequent, but mostly as joke, not a stop-word to me)

Update: I also noticed that there aren't any one-letter stop words, while in English, 'a' and 'i' are included in the list. In Spanish, these letters could be considered stop words:

  • a ('to') e (variant of 'and') o ('or') u (variant of 'or') y ('and')

https://github.com/explosion/spaCy/blob/master/spacy/lang/es/stop_words.py

mgrojo avatar Apr 16 '22 21:04 mgrojo

@mgrojo Thanks for pointing that out! If you'd like to open a PR we'd be happy to review it.

polm avatar Apr 18 '22 04:04 polm

@polm Thanks. I've already made that pull request.

mgrojo avatar Apr 18 '22 22:04 mgrojo

Some weirdness in de_core_news_md-3.3.0... I'm interested in lemmas, and I found Hässliche varies depending on the context:

>>> nlp = spacy.load('de_core_news_md')
>>> [(x.lemma_, x.pos_) for x in nlp('Die neuste philosofische Prägung wird Hässliche genannt.')]
[('der', 'DET'), ('neuste', 'ADJ'), ('philosofisch', 'ADJ'), ('Prägung', 'NOUN'), ('werden', 'AUX'), ('hässliche', 'NOUN'), ('nennen', 'VERB'), ('--', 'PUNCT')]
>>> [(x.lemma_, x.pos_) for x in nlp('die Hässliche')]
[('der', 'DET'), ('Hässliche', 'NOUN')]

dblandan avatar May 04 '22 18:05 dblandan

@dblandan The v3.3 German models switched from a lookup lemmatizer that only used the word form (no context) to a statistical lemmatizer where the output does depend on the context.

adrianeboyd avatar May 05 '22 06:05 adrianeboyd

@dblandan The v3.3 German models switched from a lookup lemmatizer that only used the word form (no context) to a statistical lemmatizer where the output does depend on the context.

So there are different lexical entries for hässliche (NOUN) and Hässliche (NOUN), and one of them is capitalized while the other isn't. I'm ok with there being different entries, but I don't understand why one isn't capitalized given that it's still a noun. :thinking:

The adjectival form lemmatizes correctly to hässlich.

For reference:

hässlich  ADJ  17149702774860831989
hässliche NOUN 5552098829343672028
Hässliche NOUN 17159517463969337747

dblandan avatar May 05 '22 08:05 dblandan

The difference is that it's not looking up word forms in a table anymore, so it's not just based on an entry related to the POS or the word form. The lemmatizer is a statistical model like the tagger that uses the context to predict the lemmas based on the training data. For more details about how it works: https://explosion.ai/blog/edit-tree-lemmatizer

adrianeboyd avatar May 05 '22 09:05 adrianeboyd

I see. I knew that the edit-tree lemmatizer was coming; I'm still surprised about this particular output. I'll just handle it in post-processing. Thanks for the reply! :smile:

dblandan avatar May 05 '22 09:05 dblandan

👋🏻 🤗

Let me know if there's a better place for this. I came across odd behavior from the English lemmatizer that seemed worth reporting.

  • Operating System: macOS, Monterey 12.2
  • Python Version Used: 3.8
  • spaCy Version Used: 3.2.4
  • Environment Information:

Here's minimal reproduction steps to see that in certain circumstances the lemmatizer predicts/maps "guys" -> "you"

>>> import spacy
>>> spacy.__version__
'3.2.4'
>>> nlp = spacy.load("en_web_core_md")
>>> nlp("The guys all")[1].lemma_
'you'

mathcass avatar May 11 '22 22:05 mathcass