sphinx
sphinx copied to clipboard
HTML Search: omit anchor reference from document titles in the search index.
Feature or Bugfix
- Bugfix
Purpose
- Adjusts the contents of the search-index provided to the client so that the browser correctly de-duplicates search results that link to the top-level title in documents.
Detail
- Adapts the transformation of
docutilsnodes.titlenodes into the title index by omitting theiridthat is used as the hyperlink anchor in the special-case of the document title. - Note: there is no explicit markup for the document title in reStructuredText.
Relates
- Resolves #11961.
- Alternative approach to client-side fix in #11942.
This seems preferable to my solution in #11942, just tested and it has the same results while making the search index smaller. :+1: from me.
Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ? @jayaddison ping me when you want it to be merged
Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ?
That's correct. I think there are three things that I don't like about this pull request, in order of priority:
- It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.
- I haven't been able to think of a way that this would happen yet, but in theory I expect some downstream tools that use the current index format could be broken by this - in particular the possibility of
nullvalues where previously everything was a string. - Related to (2), the index format isn't as small as it could be. In raw text, the JavaScript empty string can be represented by either
''or""(two characters), whereas a null value is represented by the tokennull(four characters). It may not matter a lot, but over a long enough duration of time the cumulative cost could be significant (or not! maybe it's a waste of time considering it).
It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.
I think I'll begin work on a small JavaScript refactor PR to make test coverage easier to add.
particular the possibility of null values where previously everything was a string.
Is it possible to keep a string or is the null type needed?
particular the possibility of null values where previously everything was a string.
Is it possible to keep a string or is the null type needed?
If I remember correctly, this conditional needs to be adjusted if we're using empty strings, but otherwise it should be fine.
What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.
What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds
searchindex.jswhen projects are built. It feels fragile that the test indexes are written by hand.
Moved into #12099.
I'd like to check whether we can get #12102 in place before progressing this pull request further. If that can be added, then I think adding test coverage here will be much easier and more reliable (I'll be able to create a sample Sphinx project that returns duplicate search results, and add test coverage against that).
(and maybe do the null to '' refactoring at the same time and more safely, given the test coverage)
Removing the "awaiting review" label until this PR is ready
Removing the "awaiting review" label until this PR is ready
Oops, thanks. I forgot about that.
While I think it would be sensible to merge #12102 first (if-and-when that pull request is considered acceptable!), I'm going to remove the blocker label and draft status from this pull request.
@picnixz I think this one is now ready for merge too - glad to have #12102 already-merged as preparation.
Hmm, the type-linting error highlighted by that adjustment is not entirely trivial; it bubbles up to the BuildEnvironment class. I'll fix it, but I seem to remember that environment changes can affect pickling (and thus project build caches?).
(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)
Hmm, the type-linting error highlighted by that adjustment is not entirely trivial; it bubbles up to the
BuildEnvironmentclass. I'll fix it, but I seem to remember that environment changes can affect pickling (and thus projechttps://docs.github.com/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branchest build caches?).
Hmm, my understanding is that under normal circumstances type changes do not affect runtime behaviour at all.
(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)
IMO it's fine. We're just changing what gets included in the searchindex.
Hmm, the type-linting error highlighted by that adjustment is not entirely trivial; it bubbles up to the
BuildEnvironmentclass. I'll fix it, but I seem to remember that environment changes can affect pickling (and thus projechttps://docs.github.com/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branchest build caches?).Hmm, my understanding is that under normal circumstances type changes do not affect runtime behaviour at all.
Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.
(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)
IMO it's fine. We're just changing what gets included in the searchindex.
I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.
However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.
Basically, in both cases there is some element of possible stateful data that downstream projects may have written to disk, and so I want to be careful to make sure that the changes we introduce are at-least forwards compatible (deploying an updated version of Sphinx and rebuilding a project does not fail), and perhaps even backwards compatible where possible (rolling back to a previous version of Sphinx after a rebuild with a newer version would also succeed).
Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (
Nonenow permitted) - and since theenvis serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.
I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:
https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code
(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)
IMO it's fine. We're just changing what gets included in the searchindex.
I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.
However, similar to the
envpickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.
It looks like we explicitly reject any index created with a previous version of Sphinx in this case:
https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302
Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (
Nonenow permitted) - and since theenvis serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:
https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code
I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.
Ref: b1271fa623f77140d8a96e7e29514a4f75cf1a64
However, it's not clear to me whether those loaded entries are retained before the updated search index is written, because of the code linked below; is env.all_docs the entire project (in which case it's definitely not an incremental rebuild), or are they only the modified documents?
https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/builders/html/init.py#L1163
(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)
IMO it's fine. We're just changing what gets included in the searchindex.
I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target. However, similar to the
envpickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.It looks like we explicitly reject any index created with a previous version of Sphinx in this case:
https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302
Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:
https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/environment/init.py#L59-L61
I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code
I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.
Ref: b1271fa
Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.
It looks like we explicitly reject any index created with a previous version of Sphinx in this case: https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302
Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:
https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/environment/init.py#L59-L61
You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.
I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code
I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content. Ref: b1271fa
Ok, but I don't see anything in the codebase accessing that variable (
self._titles) aside from when we write per that search query above. Let me know if I'm missing something.
Ok - that matches my understanding too; the only time we write to the search-related attributes on BuildEnvironment is during loading of an existing searchindex.js from the output build directory. Subsequently some/all of that in-memory content is invalidated before we write the updated searchindex.js to that same output build directory.
It looks like we explicitly reject any index created with a previous version of Sphinx in this case: https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/search/init.py#L302
Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run: https://github.com/sphinx-doc/sphinx/blob/de4ac2fbdebae529a973910e96794645158f0401/sphinx/environment/init.py#L59-L61
You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.
Given that we believe that the Python-side changes should be backwards-compatible here (no structural changes to the pickled env data stored-and-read by Sphinx), I think I'd prefer to leave the env-version value as-is.
NodeJS tests on this branch began failing from 9fef43ba3ed3c5cb09a264994b55dc1d124e7810 onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).
The change at 9fef43ba3ed3c5cb09a264994b55dc1d124e7810 regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.
Good news, I think.
NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).
The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.
This index-regeneration occurred directly after e947f6b2e4b1023b45679816167fbed28067177c - a merge from the mainline branch. That merge pulled in support for JavaScript testing of client-side HTML. And that happened to include a test case -- with a relevant TODO note -- indicating that some of the data in it is a workaround and should be removed when #11961 is fixed. This PR fixes #11961 - so it is in fact expected that we should remove that data, and it's good that the test is failing while it is still in place.
Edit: remove duplication of quoted comment
@jayaddison Is there anything left to be done here or is this ready for a review by a committer?
This is ready, yep; the mypy lint failure on the latest commit (64387c4ae9a10cbd6851a4a693b21b00d369511b) is unrelated (#12510), and my last two messages where I'm talking to myself (not uncommon, sadly) are because I hadn't updated the test coverage after merging from mainline (to remove a duplicate result expectation added for a separate, already-merged PR).
I'll review it tomorrow as well! (now that I've successfully defended, I have a bit more time!)
Well done completing the defence @picnixz :) Thank you!
I'll do it tomorrow (I was working on something else and missed the notification but I'm going offline now and I prefer not to merge something that I won't be able to see the merge result of).
Thank you Jay (and Will for the review!)
Thank you @picnixz!