Spacy NER to Enhance Authors
Issue by jessefogarty
Fri Mar 1 21:35:54 2019
Originally opened as https://github.com/codelucas/newspaper/issues/683
Hey,
I've been using newpaper to parse articles from feeds. Right now I'm taking the author list from newspaper and using spacy's NER to filter out publication names from the author list as sometimes I'm ending up with The Associated Plus and Bloomberg News as authors :(
I was wondering if this is of interest to anyone as I've debating integrating into the library and submitting a PR but don't want to waste the time doing if no one else besides myself will have use for it.
I'd add a flag to set the data model as to get the most accurate results (using en_core_web_lg) adds roughly 14sec to the script to load the model. Some people may find en_core_web_lm or _md sufficient for their needs and(or) want to reduce the script load time.
The NLP is cached though, so it's just a one time hit.
Thanks,
Jesse
Comment by aussetg
Wed Mar 6 10:29:38 2019
If you're introducing Spacy as a dependency then might as well replace NLTK with Spacy too.
I think I'm going to do it