GreynirCorrect icon indicating copy to clipboard operation
GreynirCorrect copied to clipboard

Single word / part of sentence correction

Open lumpidu opened this issue 4 years ago • 2 comments

I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?

Currently, when using the tokenize() method with option only_ci=True, it complains about the following:

Maðurin      Z002     Orð á að byrja á hástaf: 'maðurin'
Maðurinn     Z002     Orð á að byrja á hástaf: 'maðurinn'

Sample code:

from reynir_correct import tokenize

texts = ["maðurin", "maðurinn" ]

for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

lumpidu avatar Jan 13 '21 15:01 lumpidu

Interesting question, and this may well be a use case that we should support better. As is, the code is mostly oriented towards review of continuous text, typically whole sentences.

The code that checks the spelling of a single token is basically around this line. The call to spelling.Corrector.correct() can optionally be provided with a context, i.e. preceding tokens that will then be used to adjust the correction probabilities based on a trigram language model.

See also the short test function at the bottom of spelling.py.

vthorsteinsson avatar Jan 13 '21 18:01 vthorsteinsson

At least the documentation of tokenize() doesn't state assumptions about the text structure in contrast to the documentation of the methods check() or check_single(). Yes this use case exists e.g. for spell checking of web input forms, where often only single words or short text terms are entered.

lumpidu avatar Jan 13 '21 21:01 lumpidu