Documentation search results relevance improvements
Searching for "through" finds nothing:
This search should link at least these:
https://docs.djangoproject.com/en/3.2/topics/db/models/#extra-fields-on-many-to-many-relationships https://docs.djangoproject.com/en/3.2/ref/models/fields/#django.db.models.ManyToManyField.through
In English, the word "through" is a stopword and is ignored in the search against the English dictionary used in PostgreSQL.
From the PostgreSQL documentation:
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index.
The English word "through" is not a stopword in another dictionary for example the Italian dictionary, and in fact the search in this language shows results:
https://docs.djangoproject.com/it/3.2/search/?q=through
I figured it might be something like that. Framework function names and stuff should bypass that logic somehow.
@boxed I don't think "through" is the only stopwords that matter in search. Perhaps it would be useful to have a list of these words and then think of a way to ensure that they are not discarded. Could you write a list of words not to be deleted starting from the official PostgreSQL stopwords list? https://github.com/postgres/postgres/blob/master/src/backend/snowball/stopwords/english.stop
Hm.. I don't know about a complete list. But certainly "where" is suspicious as it's a keyword in SQL. This becomes a bit tricky as "where" should probably just be searched when it's in a code block like select or similar. I think "now" is a bit doubtful it should be excluded too as it could be something you want to search for like datetime.now which I guess the current implementation just interprets as "datetime".
That's what I could find reading through this list. I think one could image a solution where the search is run and there are no hits, then it's re-run but ignoring stopwords. This would fix the worst case at least.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We can try to create a custom English dictionary without relevant words for Django.
Noting the issue with stopwords – and also from #1496, we got the following recommendation:
Optimize the Documentation Search Algorithm: Evaluate the improvement of the internal search algorithm to provide more accurate and relevant results in response to user queries.
I’ve re-titled the issue accordingly so we consider more improvements than just stopwords refinements.
Related: Site-wide search #1499.
Considering this simple and limited scope change has seen no improvement in several years, I don't think broadening the scope of the issue is a good idea.
Talking about this issue not moving forward... Could we maybe consider building something simple in front of the current code that does a very simple string matching on just the titles in the documentation and showing that first? Maybe other hard coded searches could be added too, since for example searching "group by" shows nothing of relevance.
If anyone really wants to fix the issue with stopwords only – that’s still as welcome as it was until now.
This is a volunteer-run project, and this hasn’t been picked up in three years of it being defined as quite a narrow improvement. I think putting this in the broader context of search improvements will make it clearer to potential contributors what the goal here is. Personally what I’d like to see is a more strategic approach to this where we look at analytics on what searches are being made that have 0 results.
I don’t like the idea of hard-coded searches as we simply don’t have the capacity to maintain that kind of content. I’d rather we set up boosting based on headings (if that’s not already the case).
I agree on the statistics being very useful.