sphinx HTML Search: omit anchor reference from document titles in the search index.

Feature or Bugfix

Bugfix

Purpose

Adjusts the contents of the search-index provided to the client so that the browser correctly de-duplicates search results that link to the top-level title in documents.

Detail

Adapts the transformation of docutils nodes.title nodes into the title index by omitting their id that is used as the hyperlink anchor in the special-case of the document title.
Note: there is no explicit markup for the document title in reStructuredText.

Relates

Resolves #11961.
Alternative approach to client-side fix in #11942.

Mar 04 '24 11:03 jayaddison

This seems preferable to my solution in #11942, just tested and it has the same results while making the search index smaller. :+1: from me.

Mar 09 '24 15:03 wlach

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ? @jayaddison ping me when you want it to be merged

Mar 15 '24 14:03 picnixz

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ?

That's correct. I think there are three things that I don't like about this pull request, in order of priority:

It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.
I haven't been able to think of a way that this would happen yet, but in theory I expect some downstream tools that use the current index format could be broken by this - in particular the possibility of null values where previously everything was a string.
Related to (2), the index format isn't as small as it could be. In raw text, the JavaScript empty string can be represented by either '' or "" (two characters), whereas a null value is represented by the token null (four characters). It may not matter a lot, but over a long enough duration of time the cumulative cost could be significant (or not! maybe it's a waste of time considering it).

Mar 15 '24 15:03 jayaddison

It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.

I think I'll begin work on a small JavaScript refactor PR to make test coverage easier to add.

Mar 15 '24 16:03 jayaddison

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

Mar 15 '24 16:03 picnixz

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

If I remember correctly, this conditional needs to be adjusted if we're using empty strings, but otherwise it should be fine.

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

Mar 15 '24 16:03 jayaddison

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

Moved into #12099.

Mar 15 '24 16:03 jayaddison

I'd like to check whether we can get #12102 in place before progressing this pull request further. If that can be added, then I think adding test coverage here will be much easier and more reliable (I'll be able to create a sample Sphinx project that returns duplicate search results, and add test coverage against that).

Mar 16 '24 11:03 jayaddison

(and maybe do the null to '' refactoring at the same time and more safely, given the test coverage)

Mar 16 '24 11:03 jayaddison

Removing the "awaiting review" label until this PR is ready

Mar 17 '24 10:03 picnixz

Removing the "awaiting review" label until this PR is ready

Oops, thanks. I forgot about that.

Mar 17 '24 11:03 jayaddison

While I think it would be sensible to merge #12102 first (if-and-when that pull request is considered acceptable!), I'm going to remove the blocker label and draft status from this pull request.

May 09 '24 22:05 jayaddison

@picnixz I think this one is now ready for merge too - glad to have #12102 already-merged as preparation.

Jun 13 '24 16:06 jayaddison

Hmm, the type-linting error highlighted by that adjustment is not entirely trivial; it bubbles up to the BuildEnvironment class. I'll fix it, but I seem to remember that environment changes can affect pickling (and thus project build caches?).

Jun 14 '24 18:06 jayaddison

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

Jun 14 '24 19:06 jayaddison

Hmm, the type-linting error highlighted by that adjustment is not entirely trivial; it bubbles up to the BuildEnvironment class. I'll fix it, but I seem to remember that environment changes can affect pickling (and thus projechttps://docs.github.com/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branchest build caches?).

Hmm, my understanding is that under normal circumstances type changes do not affect runtime behaviour at all.

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

Jun 15 '24 17:06 wlach

Hmm, the type-linting error highlighted by that adjustment is not entirely trivial; it bubbles up to the BuildEnvironment class. I'll fix it, but I seem to remember that environment changes can affect pickling (and thus projechttps://docs.github.com/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branchest build caches?).

Hmm, my understanding is that under normal circumstances type changes do not affect runtime behaviour at all.

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.

However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

Basically, in both cases there is some element of possible stateful data that downstream projects may have written to disk, and so I want to be careful to make sure that the changes we introduce are at-least forwards compatible (deploying an updated version of Sphinx and rebuilding a project does not fail), and perhaps even backwards compatible where possible (rolling back to a previous version of Sphinx after a rebuild with a newer version would also succeed).

Jun 18 '24 09:06 jayaddison

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:

https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.

However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302

Jun 18 '24 11:06 wlach

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:

https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.

Ref: b1271fa623f77140d8a96e7e29514a4f75cf1a64

However, it's not clear to me whether those loaded entries are retained before the updated search index is written, because of the code linked below; is env.all_docs the entire project (in which case it's definitely not an incremental rebuild), or are they only the modified documents?

https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/builders/html/init.py#L1163

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target. However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/environment/init.py#L59-L61

Jun 18 '24 13:06 jayaddison

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.

Ref: b1271fa

Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case: https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/environment/init.py#L59-L61

You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.

Jun 18 '24 14:06 wlach

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content. Ref: b1271fa

Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.

Ok - that matches my understanding too; the only time we write to the search-related attributes on BuildEnvironment is during loading of an existing searchindex.js from the output build directory. Subsequently some/all of that in-memory content is invalidated before we write the updated searchindex.js to that same output build directory.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case: https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run: https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/environment/init.py#L59-L61

You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.

Given that we believe that the Python-side changes should be backwards-compatible here (no structural changes to the pickled env data stored-and-read by Sphinx), I think I'd prefer to leave the env-version value as-is.

Jun 18 '24 15:06 jayaddison

NodeJS tests on this branch began failing from 9fef43ba3ed3c5cb09a264994b55dc1d124e7810 onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).

The change at 9fef43ba3ed3c5cb09a264994b55dc1d124e7810 regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.

Jun 24 '24 18:06 jayaddison

Good news, I think.

NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).

The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.

This index-regeneration occurred directly after e947f6b2e4b1023b45679816167fbed28067177c - a merge from the mainline branch. That merge pulled in support for JavaScript testing of client-side HTML. And that happened to include a test case -- with a relevant TODO note -- indicating that some of the data in it is a workaround and should be removed when #11961 is fixed. This PR fixes #11961 - so it is in fact expected that we should remove that data, and it's good that the test is failing while it is still in place.

Edit: remove duplication of quoted comment

Jun 24 '24 18:06 jayaddison

@jayaddison Is there anything left to be done here or is this ready for a review by a committer?

Jul 05 '24 17:07 wlach

This is ready, yep; the mypy lint failure on the latest commit (64387c4ae9a10cbd6851a4a693b21b00d369511b) is unrelated (#12510), and my last two messages where I'm talking to myself (not uncommon, sadly) are because I hadn't updated the test coverage after merging from mainline (to remove a duplicate result expectation added for a separate, already-merged PR).

Jul 05 '24 17:07 jayaddison

I'll review it tomorrow as well! (now that I've successfully defended, I have a bit more time!)

Jul 05 '24 18:07 picnixz

Well done completing the defence @picnixz :) Thank you!

Jul 05 '24 18:07 jayaddison

I'll do it tomorrow (I was working on something else and missed the notification but I'm going offline now and I prefer not to merge something that I won't be able to see the merge result of).

Jul 07 '24 16:07 picnixz

Thank you Jay (and Will for the review!)

Jul 08 '24 11:07 picnixz

Thank you @picnixz!

Jul 08 '24 12:07 jayaddison

sphinx sphinx copied to clipboard

HTML Search: omit anchor reference from document titles in the search index.

Feature or Bugfix

Purpose

Detail

Relates

sphinx
sphinx copied to clipboard