spacy-stanza
spacy-stanza copied to clipboard
Takes too long to parse doc results
Hello, It takes too long to parse the doc object, i.e to iterate over sentence and tokens in them. Is that expected ?
snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)
for line in lines:
doc = nlp.pipe([line])
The above code takes few milliseconds (apart from initialisation) to run over 500 sentences,
snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)
for line in lines:
doc = nlp.pipe([line])
token_details = []
for sents in doc:
for tok in sents:
token_details.append([tok.text, tok.lemma_, tok.pos_])
while this takes almost a minute(apart from initialisation) to run over 500 sentences
P.S : Have put nlp.pipe() inside a for loop intentionally to get all tokens for one sentence even though it gets segmented.
@Joselinejamy nlp.pipe()
is a generator, so you're not actually executing the parser in the first block. I think that's why it seems faster: it's not actually doing the work. To make sure the parse is completed, you'll need something like:
snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)
word_count = 0
for doc in nlp.pipe(lines):
word_count += len(doc)
print(word_count)
The main efficiency problem we have at the moment is that we don't have support for batching the predictions and returning a Doc
object per item. We'd gladly accept a PR for this.
@honnibal Thank you for that instant response. But when i ran the below code with just spacy's model it took relatively less time around jus 5sec.
import spacy
start = time.time()
spacy_nlp = spacy.load('en')
for line in lines:
doc = spacy_nlp.pipe([line])
token_details = []
for sent in doc:
for tok in sent:
token_details.append([tok.text, tok.lemma_, tok.pos_])
print("Time taken : %f " % (time.time() - start))
As per the documentation,
If language data for the given language is available in spaCy, the respective language class will be used as the base for the nlp object – for example, English()
So when the same English object is used why is it taking much time ?. Or is my understanding diverged from what is intended ?
Hi, I'm also seeing a drastic performance decrease when using stanza. For a comparison, here's a project I'm working on, where I'm running a different number of parsers on over 6000 sentences. It can be seen that running CoreNLP 3 + CoreNLP 4 + spaCy roughly takes 8 times less time than running CoreNLP 3 + CoreNL4 + Stanza trough spacy_stanza.
Could this be GPU related as well ? These tests are run on a CPU, not GPU.
The stanza
models are just much slower than the typical spacy
core models. spacy-stanza
is just a wrapper that hooks stanza
into the tokenizer part of the spacy pipeline, so it looks like the pipeline components are the same as in a plain English()
model, but underneath the tokenizers are different. You can see:
import spacy
import stanza
import spacy_stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang="en")
nlp_stanza = StanzaLanguage(snlp)
nlp_spacy = spacy.blank("en") # equivalent to English()
# both are the same type of Language pipeline
assert isinstance(nlp_stanza, spacy.language.Language)
assert isinstance(nlp_spacy, spacy.language.Language)
# both [] (no components beyond a tokenizer)
assert nlp_stanza.pipe_names == nlp_spacy.pipe_names
# however the tokenizers are completely different, and the
# spacy_stanza "tokenizer" is doing all the time-consuming stanza processing
assert isinstance(nlp_stanza.tokenizer, spacy_stanza.language.Tokenizer)
assert isinstance(nlp_spacy.tokenizer, spacy.tokenizer.Tokenizer)
And as Matt said above, there's no good batching solution for stanza
at the moment, so the speed difference between nlp_spacy.pipe()
and the spacy-stanza
pipeline is going to be even higher.