the_od_bods icon indicating copy to clipboard operation
the_od_bods copied to clipboard

Improve keyword recognition in categorisation: word plural forms

Open KarenJewell opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. The current ODSCategories keyword map has a lot of contextually duplicate keywords to accomodate various form of the keword. For example "school" and "schools" to accomodate the plural form of the word. This "duplication" adds unnecessary bulk to the mapping: it lengthens the time needed to categorise and makes the mapping more cumbersome to curate.

The ultimate aim of this task is to:

  • reduce the size of the category-keyword map,
  • while also maintaining or improving the volume of matched datasets having those keywords.

Describe the solution you'd like

  • [X] Required: we can regex match keywords in plural form (an "s" suffix)
  • [X] Required: ODSCategories_Keywords must retain keyword as in the mapping (not the dataset variant)
  • [ ] Optional: consider other forms of plurals ("ies", "es")
  • [ ] Optional: consider plural forms in phrases (word groups)

Describe alternatives you've considered Stemming. We could stem the word down, but stemming might introduce all matter of complexities we may not need to handle just yet. Simple "s" suffix matches will be able to reduce a whole chunk of our current keyword-category map - low-hanging fruit.

Additional context See relevant docs: How-to-modify-category-keywords

KarenJewell avatar Jan 07 '23 13:01 KarenJewell

Will take a look at this issue.

fozy81 avatar May 26 '23 19:05 fozy81

I've submitted PR please review https://github.com/OpenDataScotland/the_od_bods/pull/240 @KarenJewell

The PR only fixes for the simple case of removing trailing 's' (upper or lowercase). It doesn't deal with other plurals ("ies", "es").

fozy81 avatar Jun 17 '23 18:06 fozy81

Merged #240

JackGilmore avatar Jul 28 '23 19:07 JackGilmore