typerighter icon indicating copy to clipboard operation
typerighter copied to clipboard

Add a Hunspell matcher

Open jonathonherbert opened this issue 5 years ago • 0 comments

What does this change?

Add a Hunspell matcher for dictionary-based matches. The code itself should be fairly straightforward – we depend on HunspellJNA and JNA, which as far as I can see must be included as unmanaged dependencies.

There are a few things in addition to the Hunspell integration that are necessary –

  • we must tokenise the input to the dictionary, strip non-word tokens like punctuation, and combine contractions, which in Stanford NLP are tokenised separately (shouldn't is tokenised as should n't)
  • there are some kinds of strings that we should probably not treat as words. I've added e-mail addresses and Twitter handles to this list (I'm sure there'll be more of these down the line!)

There's still plenty of work to in order to make this fit for user consumption. We'll need –

  • an ingest process that updates dictionary files
  • a clear understanding of how Hunspell copes with multi-word phrases, which I don't have at present
  • an idea of 'priority' for matches, as dictionary definitions are likely to be the lowest priority match (e.g they should not supercede name entries)
  • a structure to map dictionary matches to dictionary entries provided by a third party, to provide definition information for suggestions
  • to think hard about how we provide 'correct' definitions for certain kinds of entries which may be useful – for example, name lists, which we'll likely expect to highlight green if correct, along with their definition

How to test

  • The automated tests should pass, and we should be convinced they're adequately testing the matcher. They use an example dictionary with a few words defined in apps/checker/conf/resources/hunspell.
  • I've added a branch, jsh/hunspell-matcher-test-branch, that will include this test dictionary when the application runs. Running Typerighter locally or in CODE, you should find that Typerighter now issues corrections for these words. For example, in Composer, the Guardian's CMS, this looks like:
Screenshot 2022-06-10 at 14 53 14

jonathonherbert avatar Feb 11 '20 18:02 jonathonherbert