solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-17346: Synchronise stopwords from snowball with those in lucene

Open alastair opened this issue 1 year ago • 4 comments

https://issues.apache.org/jira/browse/SOLR-17346

Description

Solr's default configset comes with a collection of sample stopwords from the snowball project, There is a similar list of stopwords in the lucene repository, however these have been updated to a more recent list of snowball. Specifically, the most recent list of stopwords for the french language has removed a number of words which are homonyms of other useful words which shouldn't be skipped.

Solution

Copy the stopword files from the snowball project from lucene to solr. I only copied files that were present in https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball and only if the version of this file in solr was also from the snowball project (e.g. the english and indonesian stopwords files in solr aren't from snowball, so I didn't copy them from lucene even though they existed there).

Tests

build solr with ./gradlew dev start solr and create a new core verify that the expected files were coped to the new core verify that the core starts up

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the main branch.
  • [x] I have run ./gradlew check.
  • [ ] I have added tests for my changes.
  • [ ] I have added documentation for the Reference Guide

alastair avatar Jun 25 '24 11:06 alastair

This all looks very straightforward to me... One concern, is this something that can go on a 9.x release, or needs to ship as part of 10? The reason I ask is that if we are changing the way we apply stopwords, well, that might be NOT backwards compatible from a relevancy perspective. I don't know how we have handle other data sets like this? I could see this being something that only goes on 10x....

epugh avatar Jun 26 '24 21:06 epugh

One concern, is this something that can go on a 9.x release, or needs to ship as part of 10? The reason I ask is that if we are changing the way we apply stopwords, well, that might be NOT backwards compatible from a relevancy perspective.

Nothing in this PR changes the "way" we apply stopwords, it only changes the list of stopwords in the _default configset which we are free to do in any release (even a bugfix release if we feel it's warranted). People who expect backcompat when upgrading should not be overwriting their configsets on upgrade.

hossman avatar Jun 27 '24 22:06 hossman

@alastair how would you like to be credited in CHANGES.txt?

epugh avatar Jun 28 '24 10:06 epugh

thanks @epugh. I'm happy to be credited as I am on my commit - "Alastair Porter"

alastair avatar Jun 28 '24 13:06 alastair