SOLR-17346: Synchronise stopwords from snowball with those in lucene
https://issues.apache.org/jira/browse/SOLR-17346
Description
Solr's default configset comes with a collection of sample stopwords from the snowball project, There is a similar list of stopwords in the lucene repository, however these have been updated to a more recent list of snowball. Specifically, the most recent list of stopwords for the french language has removed a number of words which are homonyms of other useful words which shouldn't be skipped.
Solution
Copy the stopword files from the snowball project from lucene to solr. I only copied files that were present in https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball and only if the version of this file in solr was also from the snowball project (e.g. the english and indonesian stopwords files in solr aren't from snowball, so I didn't copy them from lucene even though they existed there).
Tests
build solr with ./gradlew dev
start solr and create a new core
verify that the expected files were coped to the new core
verify that the core starts up
Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the
mainbranch. - [x] I have run
./gradlew check. - [ ] I have added tests for my changes.
- [ ] I have added documentation for the Reference Guide
This all looks very straightforward to me... One concern, is this something that can go on a 9.x release, or needs to ship as part of 10? The reason I ask is that if we are changing the way we apply stopwords, well, that might be NOT backwards compatible from a relevancy perspective. I don't know how we have handle other data sets like this? I could see this being something that only goes on 10x....
One concern, is this something that can go on a 9.x release, or needs to ship as part of 10? The reason I ask is that if we are changing the way we apply stopwords, well, that might be NOT backwards compatible from a relevancy perspective.
Nothing in this PR changes the "way" we apply stopwords, it only changes the list of stopwords in the _default configset which we are free to do in any release (even a bugfix release if we feel it's warranted). People who expect backcompat when upgrading should not be overwriting their configsets on upgrade.
@alastair how would you like to be credited in CHANGES.txt?
thanks @epugh. I'm happy to be credited as I am on my commit - "Alastair Porter"