spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Remove default stop words

Open adrianeboyd opened this issue 3 years ago • 7 comments

Description

Remove default stop words

Stop words are task-specific and attempting to maintain "general-purpose" stop word lists for many different languages is not feasible.

None of the underlying functionality has been modified, the only change is that the default stop word lists are empty. Users can provide their own stop words lists for their own tasks.

(Admittedly it should be easier to modify the default stop words and it should be a setting that can be serialized with the model, but there are a number of complications related the language defaults and lex_attr_getters in general.)

Types of change

I will personally take the viewpoint: enhancement.

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [x] I ran the tests, and all new and existing tests passed.
  • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd avatar Aug 16 '22 13:08 adrianeboyd

When we mention this in the release notes / blog post for v4, it would be nice to explain to users how to find the previous default lists (from a 3.x branch on Github) so they can easily continue working with the same lists they had before.

svlandeg avatar Aug 17 '22 09:08 svlandeg

In terms of usability, I think one option would be to add a before_creation callback for setting stop words from a JSON list.

adrianeboyd avatar Aug 17 '22 09:08 adrianeboyd

In terms of usability, I think one option would be to add a before_creation callback for setting stop words from a JSON list.

Ah yes, that sounds good.

svlandeg avatar Aug 17 '22 10:08 svlandeg

I looked at all the examples in the docs for setting custom stop words and I think they should all continue to be fine.

Nothing prevents you from providing custom stop words with your custom language.

adrianeboyd avatar Aug 17 '22 10:08 adrianeboyd

The before_creation callback is kind of problematic because you don't want to reference any external data. But we can think about what to do...

adrianeboyd avatar Aug 17 '22 10:08 adrianeboyd

Could we have a default "stop word list reader" that users can plug in at the before_init callback?

svlandeg avatar Aug 19 '22 09:08 svlandeg

No, because the stop words are part of the defaults that aren't serialized with the pipelines. (Because it's effectively an is_stop method rather than a plain list/set.)

I'll have to look at it a bit more, but I've already tried to serialize them in various ways and it's basically a disaster, along with other various lex_attr_getters hacks.

adrianeboyd avatar Aug 19 '22 10:08 adrianeboyd

I would wait until the dust settles around proposals like #12244 before addressing conflicts in this PR.

adrianeboyd avatar Aug 02 '23 07:08 adrianeboyd

Temporarily closing this PR as we currently don't have the bandwidth to finish this.

svlandeg avatar Jan 29 '24 09:01 svlandeg