milli icon indicating copy to clipboard operation
milli copied to clipboard

Improve documents highlighting

Open Kerollmops opened this issue 5 years ago • 2 comments
trafficstars

The current engine doesn't highlight the query words found in the documents, this is dur to the fact that it doesn't even know where in terms of bytes a query words were extracted from the documents.

In the previous engine we were keeping the positions of the words in terms of bytes offset and length and it appears to be one big source of problem when it came to highlighting the document, many out of bounds, etc. This was also a big parts of what was stored on disk, maybe 30% of the database size on disk.

So, I thought about why not getting rid of this raw information? Using the query words the user inputed to reconstruct an higlighted document, by using the stored document.

We can highlight a document by extracting its content, using the tokenizer to find the words it contains and highlighting those matching. It could be a little bit harder when it come to highlight multi word synonyms but maybe we could find a solution later.

Kerollmops avatar Jul 06 '20 11:07 Kerollmops

The current version of the search engine now highlights the whole words and derivate words in the documents.

However we must always improve the higlighter to only highlight the word part when derivate are found (only highlight the first three letters of "hello" if your query is "hel", not the whole word). This is related to https://github.com/tantivy-search/levenshtein-automata/issues/5.

Kerollmops avatar Oct 05 '20 13:10 Kerollmops

Some update

The search engine now highlights only the query. Ex: a document with cheval and a query che will highlight this way <em>che</em>val.

However, we still have the following issue: the highlighted words are not necessarily the words that made the document matches the query. Example: a dataset containing a document with "description": "Blabla news... blabla ... good news". If the user types "good news", the document matches the query thanks to good news, not news at the beginning of the file. Despite that, the search engine will highlight news and good news. This can be problematic during the crop: the search engine indeed crops around the first matched word, here news and not good news, which are the "real" words to highlight

curquiza avatar Aug 19 '21 13:08 curquiza

This issue is no more relevant and is more an everyday work that is done in Charabia than in milli.

Kerollmops avatar Sep 27 '22 15:09 Kerollmops