spacy-stanza icon indicating copy to clipboard operation
spacy-stanza copied to clipboard

Use sentencizer with stanfordnlp

Open thomasthiebaud opened this issue 5 years ago • 1 comments
trafficstars

Right now spacy-stanfordnlp is taking care of the tokenization too. Would it be possible to use spacy' sentencizer and keeping stanfordnlp just for tagging and parsing?

I can only think about running two pipelines, the first one that only uses sentencizerand the second one that uses stanfordnlp.Pipeline. I will have a double tokenization, and probably a performance penalty

I'm getting through the doc and looking at the source code but can't find any proper way to do it

thomasthiebaud avatar Mar 05 '20 12:03 thomasthiebaud

It seems that Stanford NLP has a tokenize_pretokenized option. https://stanfordnlp.github.io/stanfordnlp/pipeline.html#running-on-pre-tokenized-text. I'll see if I can use that

thomasthiebaud avatar Mar 06 '20 16:03 thomasthiebaud

Just going through some older issues, and it sounds like you found a solution. But please feel free to reopen if you're still running into issues!

adrianeboyd avatar Oct 09 '23 14:10 adrianeboyd