sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

HTML Search: Fix duplicate results

Open wlach opened this issue 1 year ago • 8 comments

Subject: Fix duplicate search results

Feature or Bugfix

  • Bugfix

Purpose

We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.

Fix this by not using the anchor in the search result if it's an exact match on the document title. This ensures that each search result will appear at most one time.

Detail

An easy way to reproduce the bug is to do a title search: the first result will link to the section, the second will link to the page in general. It appears that the "remove duplicates" feature was added some time ago, perhaps before there were multiple ways in which documents were searched.

https://pradyunsg.me/furo/search/?q=quickstart&check_keywords=yes&area=default

Screenshot from 2024-02-04 13-42-21

Note that it appears the main sphinx documentation uses readthedocs' custom search, so this bug wouldn't be reproducible there.

Relates

Closes #11961

wlach avatar Feb 04 '24 18:02 wlach

@AA-Turner Sorry for the direct ping again, but here's another hopefully obvious search fix in the vein of #11770

wlach avatar Feb 04 '24 18:02 wlach

Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.

wlach avatar Feb 05 '24 15:02 wlach

Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.

About the test case: it's OK -- in fact much preferred! -- to invert the logic in that so that it fails until your fix from #11942 is applied to it. That helps people to understand the nature of the bug, and demonstrates how the fix addresses it.

Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:

  1. Ensure that each bug has a description somewhere (I count four you've uncovered, with one issue opened so far).
  2. Open a refactor PR to make the Search.query function testable - based on your work so far that seems to be a requirement. That PR shouldn't change any behaviour, and it (hopefully!) should be uncontroversial and easy for us to review and accept.
  3. Open either a combined PR, or one PR to resolve each bug (as you prefer -- as a reviewer, I'd like one per bug, but am OK with either). There should be at least one test case corresponding to each bug, and they should fail without the fix, and succeed with the fix in place.

I'm more-or-less up-to-date with your changes and I think they'll make a big improvement to Sphinx search quality -- all the more reason to get them included and credited in the changelog. Even so, I think total-time-to-merge may be minimized if we spend a little more time up-front to describe and separate each bugfix.

Let me know if I can help to explain anything about my recommended bugfix workflow or the reasoning behind it. There could also be things in the contribution documentation for us to update (we're not strict about bugreports before patches, but I do think they'll be helpful here).

jayaddison avatar Feb 05 '24 17:02 jayaddison

Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while: ...

Hi yeah, that more or less makes sense. I didn't realize this would be quite the rabbit hole it was when I filed this initial PR otherwise I would have been a bit more systematic in my approach. I'll do something approximately like what you suggest (I think I might not bother with the seperate refactoring PR). One of the issues is that this issue kind of depends on the other two I found yesterday being fixed to be reasonably testable.

wlach avatar Feb 06 '24 13:02 wlach

@jayaddison Hey, revised this PR to reference a specific issue as you suggested. It includes changes from other PRs for ease-of-testing, but that should go away once those are merged. Hopefully those are uncontroversial.

wlach avatar Feb 06 '24 15:02 wlach

Update: @jayaddison and I had a back and forth about different approaches in https://github.com/wlach/sphinx/pull/1.

After all that discussion, I feel pretty ok about the approach in this PR and feel like it's reasonable to land as-is (as stated in the header, https://github.com/sphinx-doc/sphinx/pull/11960 and https://github.com/sphinx-doc/sphinx/pull/11958 should go in first, then this can be rebased on top of them) though I'm still willing to make minor adjustments if necessary.

wlach avatar Feb 27 '24 14:02 wlach

Ok, I rebased this on top of all the work that landed so I think this is ready for review/landing on its own now.

In addition to the unit test, I also tested this end-to-end by running make docs target=html after installing Sphinx locally in a virtualenv.

Example output before:

image

(note double "Welcome")

Example output after:

image

(note only one "Welcome" result, pointing at the header)

wlach avatar Mar 03 '24 19:03 wlach

@wlach could you try searching for configuration and check what kind of results you get?

When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org

sphinx-doc.org (baseline)

image

current branch

image

It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.

jayaddison avatar Mar 04 '24 01:03 jayaddison

@wlach could you try searching for configuration and check what kind of results you get?

When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org

sphinx-doc.org (baseline)

[trimmed]

It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.

Well, it turns out that this was an unfair baseline to compare against.

The sphinx-doc.org site seems to be hosted by readthedocs, and the relevant Sphinx extension adds some custom code to use their search API in preference to Sphinx's own search index.

Running a search for configuration from the latest current Sphinx commit at the time of writing here (ae51974e217cc7590e1b9c68f18af3e043027c6d) produces:

image

...so basically identical results, but including many duplicates.

jayaddison avatar Mar 05 '24 15:03 jayaddison

Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.

I agree FWIW. My preference would be to merge your solution and close this. Will wait for a maintainer to weigh in though.

wlach avatar Mar 09 '24 15:03 wlach

I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.

picnixz avatar Mar 17 '24 09:03 picnixz

I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.

Ok, I'm just going to close this to reduce confusion. The code is still here if we need to try something else later, but that seems pretty unlikely at this point.

wlach avatar Mar 18 '24 01:03 wlach