texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Discussion - stopwords

Open leomaurodesenv opened this issue 3 years ago • 4 comments

I liked the texthero, and I want to contribute in somehow. First, I want to discuss something that boring me - stopwords..

Problem - I want to deploy a solution without the spacy stopwords requirements, and, possible, add my own stopwords. My solution is based on Docker containers, is a bad practice download files every time that a new containers is instanced, causing a cold start problem, also using unnecessary space (because I don't use them).

In this sense,

  • Is it possible to remove the spacy stopwords requirements?
  • How can we add general stopwords, according to our own language needs?
  • Do we have some stopwords dictionary for many languages outside spacy?
  • How turn off stopwords download?

leomaurodesenv avatar Jun 30 '21 16:06 leomaurodesenv

Hi Leonardo, thank you for opening this issue. I agree with you, it's quite annoying that stopwords are downloaded even when they are not needed. This should have been fixed in #194. I will soon release a new version that includes the patch.

Regarding your other questions:

  1. Removal of spacy stopwords requirements. I believe we can completely get rid of spacy requirements by saving in a txt file (or another file extension) all stopwords and load directly that one. Do you want to work on that?
  2. Multi-lingual support is something we would like to introduce for quite a long time ... if you are interested in helping out to develop a general solution that works for many languages I would be more than happy to talk!
  3. Currently, Texthero is fully supporting only English, adding stopwords on other languages (with Spacy for instance) should be trivial though; this is strictly related to point 1.

Hope it helps! Best,

jbesomi avatar Jul 01 '21 15:07 jbesomi

Hi Leonardo,

I just released a new version (Texthero 1.1.0); now stopwords should be downloaded lazily. Would you mind try it and let me know? Later on, we can discuss your other great points further!

jbesomi avatar Jul 01 '21 17:07 jbesomi

Hello @jbesomi , sorry for my late answer. Sure, I'm going to try out next week.

Yes, I would like to help. But, I'm not sure how to support multi-lingual stopwords.. But add multi-lingual embeddings could improve, and slowly the code. This is tough.. heheh

Removal of spacy stopwords requirements. I'm going to take a look and send a message here.

leomaurodesenv avatar Jul 03 '21 23:07 leomaurodesenv

Thanks for the update Leo. As you suggested, we can start by improving the stopwords (for English) and see how it goes. Multilingual support requires some thinking and refactoring, we can discuss that later on once the simpler version is implemented. Best,

jbesomi avatar Jul 08 '21 12:07 jbesomi