packages icon indicating copy to clipboard operation
packages copied to clipboard

Replace porter_stemmer with snowball_stemmer

Open seafoamteal opened this issue 1 month ago • 2 comments

The Snowball/Porter2 stemming algorithm performs better than the original Porter stemmer on a significant subset of words, in the sense that the stems produced are more in line with what one would expect.

I checked that all the unit tests pass after the change.

seafoamteal avatar Nov 12 '25 11:11 seafoamteal

I suppose I could find the words for which the Porter2 stemmer differs from the original, then find packages on the Gleam registry using those words in the description, and show that the older stemmer wouldn't bring up those packages on a reasonable search term. This entire thing was sparked by "repeatedly" being stemmed to "repeatedli", and so searching for "repeat" wouldn't return that package. Would more examples of the same kind help here?

Or is it enough to just show that common words are stemmed differently between the algorithms?

seafoamteal avatar Nov 18 '25 14:11 seafoamteal

Maybe a couple tests showing some new improved words would be good, like repeatedly. This would help avoid regressions too.

lpil avatar Nov 18 '25 14:11 lpil

Sorry for the delay! I've added unit tests for -ly and -ist words. While there's a lot of other cases where the Porter and Snowball algorithms differ in their stemming results, I was selecting for two things:

  1. The Porter stemmer had to produce a stem longer than a relevant search term. Otherwise, even with a sub-optimal stem, search would still function fine, as the search term would also likely stem down to the same thing.
  2. They were words that one could imagine being used for packages.

seafoamteal avatar Nov 23 '25 21:11 seafoamteal