libzim icon indicating copy to clipboard operation
libzim copied to clipboard

Spellchecking should be able to return multiple suggestions

Open kelson42 opened this issue 2 months ago • 3 comments

If the best result shares the same Levenstein (or similar) distance

Considering Xapian can not do that out-of-the-box, this issue depends on #1014

kelson42 avatar Oct 14 '25 09:10 kelson42

@veloman-yunkan I believe this is the most prioritary feature to implement now in the spellchecker. You told me you have a way to do it, but this is a bit hacky, but the following things are unclear so far:

  • What is the nature of the problem
  • How the solution would look like

I think it would be better to be clear about the this two points above before implementing a PR.

kelson42 avatar Nov 03 '25 20:11 kelson42

@kelson42

Current implementation of spelling correction (which, by the way, is fully contained in libkiwix) relies on xapian's spelling correction functionality. The latter only supports a single spelling correction per query. That limitation can be worked around by temporarily removing the returned correction from the spelling database and repeating the request, whereupon the next best correction will be returned. That procedure can be performed as many times as needed to obtain the desired number of corrections. The removed entries then must be re-inserted into the spellings DB. Such a hack has the following drawbacks:

  1. Spelling correction is not supported by the in-memory backend of Xapian, i.e. the glass (on-disk) backend has to be used, and the spelling correction operation thus becomes a non-readonly operation with respect to the on-disk data leading to: a. extra wear of SSD storage b. risk of data corruption (loss of spelling entries) if the application crashes during the spelling correction function call (this can be worked around with additional measures) c. spelling correction cannot be called concurrently in kiwix-serve d. slow

Besides, Xapian's algorithm for spelling correction is based on edit distance, rather than on phonetic similarity. If we intend to eventually provide real spellchecking instead of a surrogate one, we should use a real spellchecking engine. Switching to one will automatically enable multiple corrections.

veloman-yunkan avatar Nov 13 '25 13:11 veloman-yunkan

@veloman-yunkan Thx for the explanation. My conclusion is that this issue is blocked by #1014 and we should implement #1014 first.

kelson42 avatar Nov 14 '25 15:11 kelson42