Improve keyword recognition in categorisation: word plural forms
Is your feature request related to a problem? Please describe. The current ODSCategories keyword map has a lot of contextually duplicate keywords to accomodate various form of the keword. For example "school" and "schools" to accomodate the plural form of the word. This "duplication" adds unnecessary bulk to the mapping: it lengthens the time needed to categorise and makes the mapping more cumbersome to curate.
The ultimate aim of this task is to:
- reduce the size of the category-keyword map,
- while also maintaining or improving the volume of matched datasets having those keywords.
Describe the solution you'd like
- [X] Required: we can regex match keywords in plural form (an "s" suffix)
- [X] Required: ODSCategories_Keywords must retain keyword as in the mapping (not the dataset variant)
- [ ] Optional: consider other forms of plurals ("ies", "es")
- [ ] Optional: consider plural forms in phrases (word groups)
Describe alternatives you've considered Stemming. We could stem the word down, but stemming might introduce all matter of complexities we may not need to handle just yet. Simple "s" suffix matches will be able to reduce a whole chunk of our current keyword-category map - low-hanging fruit.
Additional context See relevant docs: How-to-modify-category-keywords
Will take a look at this issue.
I've submitted PR please review https://github.com/OpenDataScotland/the_od_bods/pull/240 @KarenJewell
The PR only fixes for the simple case of removing trailing 's' (upper or lowercase). It doesn't deal with other plurals ("ies", "es").
Merged #240