sphinx
sphinx copied to clipboard
HTML Search: Fix duplicate results
Subject: Fix duplicate search results
Feature or Bugfix
- Bugfix
Purpose
We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.
Fix this by not using the anchor in the search result if it's an exact match on the document title. This ensures that each search result will appear at most one time.
Detail
An easy way to reproduce the bug is to do a title search: the first result will link to the section, the second will link to the page in general. It appears that the "remove duplicates" feature was added some time ago, perhaps before there were multiple ways in which documents were searched.
https://pradyunsg.me/furo/search/?q=quickstart&check_keywords=yes&area=default
Note that it appears the main sphinx documentation uses readthedocs' custom search, so this bug wouldn't be reproducible there.
Relates
Closes #11961
@AA-Turner Sorry for the direct ping again, but here's another hopefully obvious search fix in the vein of #11770
Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.
Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.
About the test case: it's OK -- in fact much preferred! -- to invert the logic in that so that it fails until your fix from #11942 is applied to it. That helps people to understand the nature of the bug, and demonstrates how the fix addresses it.
Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:
- Ensure that each bug has a description somewhere (I count four you've uncovered, with one issue opened so far).
- Open a refactor PR to make the
Search.queryfunction testable - based on your work so far that seems to be a requirement. That PR shouldn't change any behaviour, and it (hopefully!) should be uncontroversial and easy for us to review and accept. - Open either a combined PR, or one PR to resolve each bug (as you prefer -- as a reviewer, I'd like one per bug, but am OK with either). There should be at least one test case corresponding to each bug, and they should fail without the fix, and succeed with the fix in place.
I'm more-or-less up-to-date with your changes and I think they'll make a big improvement to Sphinx search quality -- all the more reason to get them included and credited in the changelog. Even so, I think total-time-to-merge may be minimized if we spend a little more time up-front to describe and separate each bugfix.
Let me know if I can help to explain anything about my recommended bugfix workflow or the reasoning behind it. There could also be things in the contribution documentation for us to update (we're not strict about bugreports before patches, but I do think they'll be helpful here).
Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while: ...
Hi yeah, that more or less makes sense. I didn't realize this would be quite the rabbit hole it was when I filed this initial PR otherwise I would have been a bit more systematic in my approach. I'll do something approximately like what you suggest (I think I might not bother with the seperate refactoring PR). One of the issues is that this issue kind of depends on the other two I found yesterday being fixed to be reasonably testable.
@jayaddison Hey, revised this PR to reference a specific issue as you suggested. It includes changes from other PRs for ease-of-testing, but that should go away once those are merged. Hopefully those are uncontroversial.
Update: @jayaddison and I had a back and forth about different approaches in https://github.com/wlach/sphinx/pull/1.
After all that discussion, I feel pretty ok about the approach in this PR and feel like it's reasonable to land as-is (as stated in the header, https://github.com/sphinx-doc/sphinx/pull/11960 and https://github.com/sphinx-doc/sphinx/pull/11958 should go in first, then this can be rebased on top of them) though I'm still willing to make minor adjustments if necessary.
Ok, I rebased this on top of all the work that landed so I think this is ready for review/landing on its own now.
In addition to the unit test, I also tested this end-to-end by running make docs target=html after installing Sphinx locally in a virtualenv.
Example output before:
(note double "Welcome")
Example output after:
(note only one "Welcome" result, pointing at the header)
@wlach could you try searching for configuration and check what kind of results you get?
When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org
sphinx-doc.org (baseline)
current branch
It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.
@wlach could you try searching for
configurationand check what kind of results you get?When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org
sphinx-doc.org (baseline)
[trimmed]
It's possible that this is a configuration setting related to the way that the
sphinx-docs.orgsite is deployed. Either way, it'd be nice to confirm.
Well, it turns out that this was an unfair baseline to compare against.
The sphinx-doc.org site seems to be hosted by readthedocs, and the relevant Sphinx extension adds some custom code to use their search API in preference to Sphinx's own search index.
Running a search for configuration from the latest current Sphinx commit at the time of writing here (ae51974e217cc7590e1b9c68f18af3e043027c6d) produces:
...so basically identical results, but including many duplicates.
Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.
I agree FWIW. My preference would be to merge your solution and close this. Will wait for a maintainer to weigh in though.
I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.
I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.
Ok, I'm just going to close this to reduce confusion. The code is still here if we need to try something else later, but that seems pretty unlikely at this point.