Offer search term spelling corrections
This is a common feature of mean free text search engines and this can be helpful.
Xapian provides a core feature for that https://docs.huihoo.com/xapian/docs/spelling.html
Original ticket on Sourceforge https://sourceforge.net/p/kiwix/bugs/849/
Download the Kiwix application for android and I installed it. I
downloaded the Wiktionary in Spanish, unzipped it and upload it to the
external memory of the smart phone. Since I read the application file
and all is well.
But when I write the wrong word Wiktionary does not correct
me. Example: in Spanish is written: ZAPATO. If SAPATO write the
application tells me: Error: "failed load SAPATO article", but does
not correct me should show me the options. You mean I have to be an
expert in the language to find, does not help me that way because the
objective is to correct me when I'm wrong.
If I do the same on the computer shows me as options:
1- Zapato
2 - Calzado
3 - Pasta de zapatos
4 - It is possible to improve the application for android?
5 - I'm failing at something?
Here is a quick proof-of-concept in Python, showing that Xapian's builtin functionality would cover some common misspellings as conducted by people learning German, either as their first or as a second language.
https://github.com/gremid/xapian-spelling-suggestions/
Two changes two libzim's index code would be necessary:
- During indexing the title of a ZIM entry has to be added to a spelling dictionary which is later used for lookups.
- During retrieval and in case that there are no results for a given (exact) query, the spelling dictionary would be queried for suggestions.
Do we want to have spelling suggestion to both fulltext and suggestion (title) searches ?
Title suggestions would be sufficient from our, that is the DWDS perspecitve. As a dictionary, headword/title search is the main use case. Also the app is already sizable in comparison to the average, so saving some space by only indexing titles would also be in our immediate interest.
@mgautierfr Only suggestions. I don't think we should make this optional (because the additional index data are not that big), but we need a way to be the libzim backward compatible.
Xapian provides two methods to add and retrieve spelling suggestion:
WritableDatabase::add_spelling, at db creation add a word to be considered has spelling suggestionDatabase::get_spelling_suggestion, at runtime, give one (and only one) suggestion for a given word.
Proposition:
At libzim level, there is really few to do:
- Add a metadata to the db to tell suggestion is available.
- When adding a item (suggestion title) to the database, we add the words of the item (minus stop words) to the db.
- Add a method to
SuggestionSearcher::has_spelling_suggestionto tell if spelling suggestion is available. - Add a method
SuggestionSearcher::get_spelling_suggestionto get the spelling suggestion for a word. This method would simply forward the call toget_spelling_suggestion. If spelling is not available (old db),get_spelling_suggestionwill return an empty string.
While this technically add spelling suggestion feature to libzim, the majority of the work has to be done in dependent projects:
- Check for spelling suggestion for each word of the query
- Ask user if they what to use correction or do automatic correction ?
- Handle multi lang (which stop words to use ?)
- Rerun suggestion search with corrected query
- Improve UX
Note that suggestion is totally independent of the language (no stopwords, stem are used by xapian) It is up to caller code (zim-tools, libkwix, kiwix-tools) to properly remove stop words and ask for suggestion and use them when appropriated.
Testing:
- Create a new testing zim file with spelling suggestion
- Test spelling suggestion is available or not depending of the zim file "version"
- Test spelling suggestion when applicable
@mgautierfr We need this feature meanwhile ASAP, can you please make an estimation of the wffort so we can move ahead
Xapian provides two methods to add and retrieve spelling suggestion:
WritableDatabase::add_spelling, at db creation add a word to be considered has spelling suggestion
Database::get_spelling_suggestion, at runtime, give one (and only one) suggestion for a given word.
Also it is possible to obtain a spelling corrected query (yet, as with Database::get_spelling_suggestion() only one variant is returned):
- https://xapian.org/docs/spelling.html#queryparser-integration
Xapian::QueryParser::get_corrected_query_string()
Overall the spelling correction functionality of Xapian seems quite limited. In particular, it looks like it won't work for autocomplete out of the box (if the spelling error is in the beginning of the word).
On the one hand it's better than nothing, however if it falls short of user expectations the impact on user experience may be negative compared to simply not offering that functionality at all.
I think that currently we should target the simplest version for a single word search.
We tested the functionality with 66 typical misspellings of German lexemes and were quite optimistic that the user experience would improve: https://github.com/gremid/xapian-spelling-suggestions .
Why do you think that increased recall in the case of an unsuccessful query would reduce usability?
@gremid In my comment I meant the attempt to offer usable spelling correction in the general case (for a multiword query). I don't think that we can achieve with Xapian the quality which the users should by now have gotten used to with popular web search engines that take context into account. Correcting a single word query is the best we may strive for at this point.
If we pursue spelling correction of only single-word queries (for title search only), I wonder if it makes sense to enhance the Xapian DB/index embedded in the ZIM file. We can instead create a temporary (in-memory) index for spelling corrections when opening the ZIM file (or on first use). Of course that will increase the latency of spelling correction functionality becoming available, however I think that we should do it at least as a proof of concept. I don't like the spelling correction functionality of Xapian to an extent of suspecting that we will have to move to something else eventually. Experimenting with it without any changes to the ZIM spec looks like a reasonable approach. @kelson42 What do you think?
Here are some edge cases that we may have to deal with even in the proposed simple case of single-word queries if we run spelling correction unconditionally on all ZIM files. Examining the terms recorded in the titledb of the wikipedia_en_all_maxi_2024-01.zim ZIM file reveals things like:
- various numbers
0.1,0.01, etc up to0.000000000000000000000000000001and their siblings with comma,used instead of the decimal point; same for0.9through0.9999999999999999999999999999999- various fractional values (e.g.
0.500,0.501,0.522,0.55,0.56,0.571046,127,623) - integer values coming from numbering or various mostly nameless objects (e.g.
19208) - hexadecimal values (
0x5f3759df,0xbaadf00d,0xc0dedbad, etc)
- software version strings (
0.9rc1,1.1.1propellane,23.0.1) - IP addresses (
127.0.0.1through127.0.0.8,192.168.0.1,192.168.1.11,192.88.99.1) - what looks like an IP but is rather a entry in an enzyme database (
2.3.1.121) - various codes of places
- non ASCII strings (
qコちゃんthe地球侵略少女,2814,fibers(fiis a single Unicode symbol),ꢚ,𝜋, a lot of words/phrases in languages other than English)
The full list of terms is in the attached file.
After some research, I'd rather go with a more specialized spellchecker tool like https://github.com/hunspell/hunspell or https://github.com/nuspell/nuspell and generate input (dictionary) for the spellchecker from a filtered list of terms found in the titledb
@veloman-yunkan Thank you for the anaysis. I will come out to you soon to wrap-up and discuss next steps.
Thinking aloud:
-
From the use model perspective, suggestions can be enhanced with spelling correction as follows: Two types of suggestions are shown to the user (in that order):
- auto-completion or spelling corrections of the last word in the query (assuming that the query is edited only at the end); selecting this suggestion modifies the query
- title suggestion for the query exactly as entered; selecting this suggestion directly takes to the respective article
-
How implementation of spelling correction should be split between
libzimandlibkiwix? One way is to perform spelling correction inlibkiwix, withlibzimonly providing data for it (somehow exposing the terms database to be used as input for the spellchecker module). Alternatively, spelling correction functionality may be fully provided bylibzim. For the use model proposed above, the latter approach can be implemented (or, rather, hacked) with only a slight semantic enhancement of thelibzimsuggestion API, fully limited tozim::SuggestionItem: for suggestions of auto-completion/spelling-correction kind,zim::SuggestionItem::getPath()andzim::SuggestionItem::getTitle()should return an empty string, whilezim::SuggestionItem::getSnippet()should return the text of the modified query. This looks like a good shortcut for experimenting with spelling correction support in libzim. Readers will only have to be slightly enhanced to handle the new kind of suggestions.
A proof-of-concept implementation is ready in #994 and can be tested via kiwix/libkiwix#1198. Please don't use large ZIM files while testing.
@veloman-yunkan Thank you for showing leadership on this issue. I really appreciate it. Can you please:
- Put a comment on the PR(s) so we better understand what is the approach you have chosen (and the limits). Is that for a single-word? Is that Xapian based? How does it behave with the testset of @gremid?
- Provide a small ZIM file (of DWDS ZIM) with the index, so we can focus on testing the user side
- Thank you very much for proposing https://github.com/kiwix/libkiwix/pull/1198, but can you please implement https://github.com/openzim/zim-tools/issues/469 first. The command line version is simpler/clearer for testing.
Here are a few answer of your past questions:
We can instead create a temporary (in-memory) index for spelling corrections when opening the ZIM file (or on first use).
Really not in favour of this because I see no advantage to make n times this operations (each time the file is open) if you can do it only once at writer time. No problem for the POC.
I don't like the spelling correction functionality of Xapian to an extent of suspecting that we will have to move to something else eventually.
I understand your concern, but this is still IMHO the way forward. We need to think things through before releasing libzim with the feature... but at the same time, this is for sure something which will evolve, this is why we have a flexible search index system in the ZIM spec.
How implementation of spelling correction should be split between libzim and libkiwix
Everything in the libzim. This should be exploitable without libkiwix. See https://github.com/openzim/zim-tools/issues/469
@veloman-yunkan Thank you for showing leadership on this issue. I really appreciate it. Can you please:
Multi-word queries are supported but spelling suggestions are offered only for the last word akin to completion suggestions. Spelling and/or completion suggestions are offered instead of title suggestions if there are too many matches (more than can be displayed), so spelling/completion suggestions serve as a way to narrow down the search. The PoC is Xapian based though it uses a hack in order to offer more than one spelling suggestion.
- Provide a small ZIM file (of DWDS ZIM) with the index, so we can focus on testing the user side
There is no need for special ZIM files, the PoC works by creating the additional (temporary) index on demand. This will let us figure out the best settings for creating that additional index, and then we can embed it in the ZIM file if we decide to do so (the other option being creating it only the first time when the ZIM file is opened and storing that data in a special cache directory).
- Thank you very much for proposing https://github.com/kiwix/libkiwix/pull/1198, but can you please implement https://github.com/openzim/zim-tools/issues/469 first. The command line version is simpler/clearer for testing.
The said zim-tools issue doesn't specify any details. Do you mean that the spell checking functionality should be added to zimsearch? In any case the current PoC implementation should be easier to play with in an interactive environment within a single run of the executable (with a command-line version the additional index will have to be built on every invocation).
Status, after discussion with @veloman-yunkan about status and next steps, we concluded:
- Multiple word suggestion is not straight
- Xapian spellchecker is not that great, we pretty sure need ultimately an other one, but this is not straight either because we add a new compilation dependency
- We need to release a first version with following qualities
- Use Xapian
- Works for one word only
- Is - in the libzim - handled separatly from the suggestions. If this is reunited, this is at a hight level (libkiwix or reader itself)
- Provide a simple API which AFAP won't change (API stability)
- Create a spellcheck index on the fly (to avoid future ZIM format compatbility problems) and the creation of this index should not impact (much) the user (background creation at first launch)
- Satisfy as much as possible the requirements of https://github.com/gremid/xapian-spelling-suggestions/blob/main/testdata.csv
- Is simply testable via
zimsearch, see https://github.com/openzim/zim-tools/issues/469
This should be implemented quickly based on the researches donc in #994. Once merged, we will open multiple smaller issues to follow-up on the improvements.
- Satisfy as much as possible the requirements of https://github.com/gremid/xapian-spelling-suggestions/blob/main/testdata.csv
I found one case that we won't be able to satisfy with (our version of) Xapian. The spelling correction "Lax -> Lachs" cannot be returned by Xapian versions after v1.4.18 because the max edit distance is capped at length(query_word) - 1 which reduces the value of the max edit distance argument from 3 to 2, making the spelling correction impossible for that input.
Thanks for spotting that limitation; we can live with that.
- Create a spellcheck index on the fly (to avoid future ZIM format compatbility problems) and the creation of this index should not impact (much) the user (background creation at first launch)
@kelson42
My initial intent for the prototype stage was to implement a disposable in-memory index for spellings but then it turned out that the in-memory backend of Xapian doesn't support spellings. So the workaround in #994 and #1007 was to create the on-disk Xapian database in an in-memory file-system /dev/shm, which may fail to work on Android and/or iOS. Given that the Xapian database used for spelling corrections is an on-disk one, it makes more sense to create it only once per ZIM file and keep it cached in the persistent file system. So we need to introduce to libzim a concept of a cache directory via a new internal method getCacheDirectory(). Is it OK to do so, or it goes against certain design assumptions/requirements behind libzim?
@veloman-yunkan What would exactly do this method? To me this is the role of the libzim user to tell where to save this index. I would prefer that the feature activation is triggered by the creation of the index of the disk. This been done by a dedicated function call "writeSpellCheckerIndex(string path, bool rewrite)".
@kelson42 The intended API for spelling correction doesn't expose anything about spelling databases. There is only a new method getSpellingSuggestions(word, maxCount) which will initially create the index on demand and will save it in the cache directory. After the implementation stabilizes we may consider moving the spelling database into the ZIM file.
OK, so where are you writing this index?
As explained above, currently I write it in a temporary directory under /dev/shm. I want to be able to write it in a persistent directory serving as a cache directory for libzim (e.g. ~/.cache/libzim under Linux). getCacheDirectory() will determine the path of that directory depending on the host platform. We can also enable the user of libzim to control it via a public API method setCacheDirectory().
We have to give the control to the user. I see no way the libzim could know in which directory the user/dev wants to save this index.
Then we should avoid to call it "cache", because there is already so many cache that we are lost with them. We will create a dedicated method with a very specific name. This method will disappear in libzim10.
Then, regarding what triggers the creation of the index. I see you want to do it at first spellchecking run. This approach will create a delay in a first call. This is why I recommended to do that - earlier- at the time the libzim client calls explicitly to set directory.
I think there is some misunderstanding.
With the proposed enhancement I want to introduce to libzim a more general idea of persistent cached data. That data will be stored in the mentioned cache directory. Spell checking will be the first client of that functionality (and may later stop taking advantage of it when the spellchecking index is embedded in the ZIM file). But there may be other features that will benefit from such persistent caches.
@veloman-yunkan I believe to understand what you aim to do, but:
- ZIM file principle is that it comes with everything needed and is directly and fully exploitable. Therefore this kind of cache is not needed
- This approach here is only temporary, goal is then to move forward to embed the index in the ZIM
- API will be remove as quickly as possible and survive hopefully only a few minor release until 10.0.0 release
- We should be focus to deliver the feature, therefore not make the problem complexer than it is already
Feature as been implemented temporary in the libkiwix, see https://github.com/openzim/libzim/issues/731 Issue has been created to:
- Move the feature in the libzim https://github.com/openzim/libzim/issues/1011
- Implement multiple spellchecking suggestions https://github.com/openzim/libzim/issues/1012
- Handle multiple word pattern https://github.com/openzim/libzim/issues/1013
- Consider using an other spellchecking library https://github.com/openzim/libzim/issues/1014
@veloman-yunkan Could/Should we close this issue in favour of the new issues I have created? Do we missed something?