milli
milli copied to clipboard
Improve documents highlighting
The current engine doesn't highlight the query words found in the documents, this is dur to the fact that it doesn't even know where in terms of bytes a query words were extracted from the documents.
In the previous engine we were keeping the positions of the words in terms of bytes offset and length and it appears to be one big source of problem when it came to highlighting the document, many out of bounds, etc. This was also a big parts of what was stored on disk, maybe 30% of the database size on disk.
So, I thought about why not getting rid of this raw information? Using the query words the user inputed to reconstruct an higlighted document, by using the stored document.
We can highlight a document by extracting its content, using the tokenizer to find the words it contains and highlighting those matching. It could be a little bit harder when it come to highlight multi word synonyms but maybe we could find a solution later.
The current version of the search engine now highlights the whole words and derivate words in the documents.
However we must always improve the higlighter to only highlight the word part when derivate are found (only highlight the first three letters of "hello" if your query is "hel", not the whole word). This is related to https://github.com/tantivy-search/levenshtein-automata/issues/5.
Some update
The search engine now highlights only the query.
Ex: a document with cheval and a query che will highlight this way <em>che</em>val.
However, we still have the following issue: the highlighted words are not necessarily the words that made the document matches the query.
Example: a dataset containing a document with "description": "Blabla news... blabla ... good news". If the user types "good news", the document matches the query thanks to good news, not news at the beginning of the file. Despite that, the search engine will highlight news and good news. This can be problematic during the crop: the search engine indeed crops around the first matched word, here news and not good news, which are the "real" words to highlight
This issue is no more relevant and is more an everyday work that is done in Charabia than in milli.