Single word / part of sentence correction
I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?
Currently, when using the tokenize() method with option only_ci=True, it complains about the following:
Maðurin Z002 Orð á að byrja á hástaf: 'maðurin'
Maðurinn Z002 Orð á að byrja á hástaf: 'maðurinn'
Sample code:
from reynir_correct import tokenize
texts = ["maðurin", "maðurinn" ]
for t in texts:
g = tokenize(t, only_ci=True)
for t in g:
if t.txt:
print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
Interesting question, and this may well be a use case that we should support better. As is, the code is mostly oriented towards review of continuous text, typically whole sentences.
The code that checks the spelling of a single token is basically around this line. The call to spelling.Corrector.correct() can optionally be provided with a context, i.e. preceding tokens that will then be used to adjust the correction probabilities based on a trigram language model.
See also the short test function at the bottom of spelling.py.
At least the documentation of tokenize() doesn't state assumptions about the text structure in contrast to the documentation of the methods check() or check_single(). Yes this use case exists e.g. for spell checking of web input forms, where often only single words or short text terms are entered.