tournesol full-text search (#902)

design choices, notes, considerations... :

1 - Choice of Postgres full-text search (instead of the more basic Whoosh and the more advanced ElasticSearch)

2 - Creation of specific Postgres configurations with the unaccent filter dictionary added (the only difference is that accents are ignored).

3 - Relevance score that does not depend on the length of the text (a bit unfair for entities with short descriptions, but it probably gives better results in my opinion, since it thus gives results always between 0 and 1, c.f. "normalization option" in https://www.postgresql.org/docs/current/textsearch-controls.html).

4 - No autocorrection using trigrams (looks more suited to searches without indexes - some extensions exist but not sure the gain is worth the additional complexity). With Django trigrams, it looks like it searches the similarity with the whole field, not with the words in the field. Thus, if I type "misteke" and there is a video named "how to stop making mistakes", it won't match.

5 - No synonyms search. Postgres supports it, but the difficulty is to find synonyms dictionaries. It looks usually used in the litterature for purpose-specific needs, not with every synonyms of every word.

6 - The search is fast (the queryset is usually evaluated in around 30 ms), but building every search vector can be long (15 seconds on my fast computer). This is because every entity is separately updated with the new search_vector value. I tried to use the queryset update function, which would be much faster, but eventually gave up because the config parameter of SearchVector does not seem to accept F("metadata__language"), it looks like it needs an explicit value.

7 - Not including the subtitles into the full-text search. Mainly because it is already long to build the search vector.

8 - The entity "save" method is overridden to automatically update the SearchVectorField. This costs some extra time, but it would have been complicated to do with Postgres triggers (because of variable configs, weights, metadata formats...).

9 - Score function that mixes the total_score with the relevance score. I chose a function that promotes more the search relevance than the total_score. It would be simpler to have a function like relevance * (weights_sum + total_score) , but perhaps it would downgrade relevant but badly-rated entities too much.

10 - A GIN index was added to SearchVectorField. The section "Index size/usage statistics" of https://wiki.postgresql.org/wiki/Index_Maintenance shows how to display the index sizes (7 Mbytes in dev-env) and usage statistics. It sometimes happened that Postgres randomly chose not to use it, making the queryset processing 3 times longer. Here are the queryset example logs (at some point when Postgres was not using the GIN index for querys of only 1 word), but don't waste too much time trying to make sense out of it : used_gin.txt unused_gin.txt

11 - Filter via a normal search ("qs.filter(search_vector=search_query)") and then make a SearchRank. Instead of firstly making a SearchRank to annotate the relevance and then filtering on some relevance threshold. It gives faster results. A consequence is that every word in the query must match the text vector. Before that, you could have 2 words matching among 5 query words and it would not be filtered.

12- language-specific stemming and stop-words removal. The alternative would be the have the same configuration for every languages (=> lower-quality search, but simpler to manage and possibility to make much faster index building at startup)

13 (to deliberate) - If no language is given in the query (or none of the given languages are supported, Postgres supports fewer languages than our language filters), use a classical search without search vector. This can be problematic if users usually have no recommendations language filter. Possible solutions include searching for every possible languages even if it is slower ( ~ 120 ms instead of 30 ms), or picking a few languages by default, or creating a 2nd SearchVectorField with a language-independent config... This may be worth a short discussion on the vocal chat.

14 (to deliberate) - Shouldn't the frontend someway automatically remove the date filter "A month ago" to return more results? This is especially important because people (including the past me) may not realize that they get very few results because of this filter.

I'll be available for a Discord vocal chat most of the days this week if you have questions

Jul 09 '22 11:07 glerzing

Hello @glerzing and thank you for this amazing work :100:

I'm going to test it today.

1 - Choice of Postgres full-text search (instead of the more basic Whoosh and the more advanced ElasticSearch)

OK

2 - Creation of specific Postgres configurations with the unaccent filter dictionary added (the only difference is that accents are ignored).

OK

3 - Relevance score that does not depend on the length of the text (a bit unfair for entities with short descriptions, but it probably gives better results in my opinion, c.f. "normalization option" in https://www.postgresql.org/docs/current/textsearch-controls.html).

I understand. I haven't made any test yet, but intuitively as we are working with not-that-big descriptions, it seems reasonable to use a 0 normalization. Also, videos with YouTube descriptions containing a lot of links to sources / social networks / etc. might be punished by normalization.

4 - No autocorrection using trigrams (looks more suited to searches without indexes - some extensions exist but not sure the gain is worth the additional complexity)

OK. The current implementation is already a huge improvement, we might think about auto-correction later.

5 - No synonyms search. Postgres supports it, but the difficulty is to find synonyms dictionaries. It looks usually used in the litterature for purpose-specific needs, not with every synonyms of every word.

OK.

Do you think it's a good idea, in a second time, to add few specific dictionaries? or will this bias too much the search?

I'm thinking about dictionaries related to climate / environment for instance.

6 - The search is fast (the queryset is usually evaluated in around 30 ms), but building every search vector can be long (15 seconds on my fast computer). This is because every entity is separately updated with the new search_vector value. I tried to use the queryset update function, which would be much faster, but eventually gave up because the config parameter of SearchVector does not seem to accept F("metadata__language"), it looks like it needs an explicit value.

Alright. When you say building every search vector, are you talking about building them for all entities in the database?

If yes, this is an operation that will be very rare, if the search_vector of an entity is updated on a regular basis, for instance, at the same time of a metadata refresh (which is currently the case thanks to the save method).

7 - Not including the subtitles into the full-text search. Mainly because it is already long to build the search vector.

Is it long the build the search vector of an entity, or all of them?

8 - The entity "save" method is overridden to automatically update the SearchVectorField. This costs some extra time, but it would have been complicated to do with Postgres triggers (because of variable configs, weights, metadata formats...).

OK. Let's see the performance.

Jul 11 '22 09:07 GresilleSiffle

13 (to deliberate) - If no language is given in the query (or none of the given languages are supported, Postgres supports fewer languages than our language filters), use a classical search without search vector. This can be problematic if users usually have no recommendations language filter. Possible solutions include searching for every possible languages even if it is slower ( ~ 120 ms instead of 30 ms), or picking a few languages by default, or creating a 2nd SearchVectorField with a language-independent config... This may be worth a short discussion on the vocal chat.

This question has been discussed during the weekly meeting. Here is the suggested solution when no language filter is defined:

build a SearchQuery with at most 2 languages: English + the UI language (can be accessed via Django's request.LANGUAGE_CODE, based on the Accept-Language header)
in this case no explicit language filter is applied in the database query: the results may be in any language.

@glerzing Does it seem a reasonable solution to you? Or should we expect irrelevant results when mixing different language configurations in the SearchQuery and the SearchVector being queried?

14 (to deliberate) - Shouldn't the frontend someway automatically remove the date filter "A month ago" to return more results? This is especially important because people (including the past me) may not realize that they get very few results because of this filter.

This is a separate issue, which may influence the behavior of the filters in other contexts. A new issue #1042 has been created (thanks @GresilleSiffle)

Jul 13 '22 16:07 amatissart

responses to @GresilleSiffle :

Do you think it's a good idea, in a second time, to add few specific dictionaries? or will this bias too much the search?

It shouldn't be too hard to add a dictionary of synonyms in the Postgres config. The difficulty is more about finding or maintaining the dictionaries, especially if in multiple languages. It depends on the balance usefulness - efforts. What examples of words do you have in mind?

Alright. When you say building every search vector, are you talking about building them for all entities in the database?

Yes, during the migration.

Is it long the build the search vector of an entity, or all of them?

Rebuilding all the entities (in the dev-env environment, in which the database is reduced) was taking between 15 and 20 seconds (=> 2.7 ms per entity). Refreshing the metadata of 50 entities was taking 5.2 seconds in my environment without rebuilding the search vectors. When refreshing the search vectors, it was taking 6.5 seconds, so 25% more (=> 26 ms per entity).

responses to @amatissart :

build a SearchQuery with at most 2 languages: English + the UI language (can be accessed via Django's request.LANGUAGE_CODE, based on the Accept-Language header) in this case no explicit language filter is applied in the database query: the results may be in any language.

I'm not sure I understand your proposition. When @GresilleSiffle explained it to me I understood : if there is a search for keywords and no language filter, it will automatically add english ("en") and the user's language. That looked good to me. I'm not sure if it's better to do it in the backend or in the frontend. Besides, I guess that it means that I can remove search_without_vector_field ? Or will it still be used when there is a language filter with only languages unsupported by Postgres?

in this case no explicit language filter is applied in the database query: the results may be in any language. @glerzing Does it seem a reasonable solution to you? Or should we expect irrelevant results when mixing different language configurations in the SearchQuery and the SearchVector being queried?

You could have a language-independent search-vector (but of worse quality in terms of stop words and stemming). But as I did, every entity has it's language config. If you make a SearchQuery("word", config="en") | SearchQuery("word", config="fr"), it will sequentially take the "en" config, process the query (stemming, stop words...) and search among entities in search_vector that have the same "en" config. Then, it will do the same for "fr". So, if you want to make a search independently of the language, you have to collate (with "|") a SearchQuery for every language configuration. It will not be irrelevant, just slower. (~120 ms instead of 30 ms)

Jul 13 '22 20:07 glerzing

Thank you @glerzing for having addressed all my previous comments :ok_hand:

Jul 16 '22 10:07 GresilleSiffle

tournesol tournesol copied to clipboard

full-text search (#902)

tournesol
tournesol copied to clipboard