sphinx
sphinx copied to clipboard
Fix linkcheck anchor encoding issue
Fix linkcheck anchor encoding issue (#13620)
Description
This PR fixes an issue where the linkcheck builder incorrectly reports "Anchor not found" errors for URLs with encoded characters in fragment identifiers (anchors), despite these URLs working correctly in web browsers.
Current Behavior
When encountering a URL with percent-encoded characters in the anchor/fragment (e.g.,
https://example.com/page#standard-input%2Foutput-stdio), the linkcheck builder:
- Extracts the fragment:
standard-input%2Foutput-stdio - Decodes it to:
standard-input/output-stdio - Searches for an HTML element with
id="standard-input/output-stdio"orname="standard-input/output-stdio" - Reports a broken link when the element isn't found, even though the URL works in browsers
Changes Made
- Enhanced
AnchorCheckParserto check for multiple variants of the anchor:- The decoded version (current behavior)
- The original encoded version
- A re-encoded version if the decoded version contains encoding-required characters
- Added comprehensive tests to verify the new behavior
- Updated the
contains_anchorfunction to accept both decoded and original encoded anchors - Added entry to CHANGES.rst
Testing Done
- Added unit tests for the
AnchorCheckParserclass - Added integration tests with a mock HTTP server that serves HTML with encoded anchors
- Verified that all tests pass with the new implementation
Fixes
Fixes #13620
After some initial confusion (documented to some extent in the linked issue thread #13620), I'm now supportive of this functionality, and would like to see this merged. I would like a few adjustments/refactorings to be made before then, though.