compromise icon indicating copy to clipboard operation
compromise copied to clipboard

Equivalent to nltk.corpus stopwords

Open Utopiah opened this issue 3 years ago • 1 comments

Hi, I'm just learning about the project and it's pretty amazing. I tinkered with NTLK and Gensim before but this is so convenient to explore and embed on a page. Learning with Observable notebooks is also great!

That being said I end up for a lot of noise in my selection. I tried a bit of normalize() and remove() with encouraging results. Still, I'm quite surprised that when I search in this repository I don't seem to find stop words.

This made me wonder, is this the "wrong" way in this context? Is the philosophy of compromise not to rely on such lists?

PS: I apologize for hijacking issues but is there a forum/chat/platform for discussions on using compromise that would a better place? I have other questions like using .tfidf() on .ngrams() but I don't make to create noise here.

Utopiah avatar Aug 12 '22 14:08 Utopiah

hey Fabien, you're talking about the results of the wikipedia plugin right?

Yeah, super noisy. it really needs a lot of work. Yeah, i was using a stop-list here but that was just me eyeballing it. It could really use a PR, if you want to take a swing at it.

To do it properly, we should also add (some!) wikipedia redirects. I held-off because the results were still so rowdy. cheers

spencermountain avatar Aug 12 '22 18:08 spencermountain