sphinx linkcheck performance: downloading page multiple times when checking anchors

Problem

If my sphinx documentation contains multiple links with anchors to a web page with multiple anchors, it will download the page multiple times, once per anchor to check
This scales very badly. If I have many hundreds or thousands of anchors (e.g. for automatically generated documentation), it might download several megabytes × the number of links. This can end up being multiple gigabytes

Procedure to reproduce the problem

create a document with links to anchors on the same web page
run the link checker; it will fetch the page multiple times

Expected results

I would suggest that the link checker could cache the anchors on webpages, so that it only downloads each page once, and only checks each link once. It could build a dictionary of pages to check, and store the anchors as a list or dict within it? Since we know up front which of our links have anchors, we can skip storing them when we know it's unnecessary.
There may be other better ways of doing this; I'm not familiar with the internals of the link checker.

Reproducible project / your project

https://github.com/openmicroscopy/bioformats/tree/develop/docs/sphinx
contains lots of links to https://www.openmicroscopy.org/Schemas/Documentation/Generated/OME-2016-06/ome_xsd.html

Environment info

OS: Any
Python version: Any
Sphinx version: Any

Dec 15 '17 09:12 rleigh-codelibre

cc @jayaddison

Jan 15 '24 07:01 AA-Turner

I'm taking a break from development here for a little while, but had begun investigation and considered an approximate design for how to resolve this:

We have a ConnectionMeasurement helper in our linkcheck tests that inspects the number of HTTP connections made when unit tests run. Following test-driven-development, we could lower the target value for that, and the fixing this bug should help to confirm the results (not the only test coverage required, but a useful confirmation).
The contains_anchor function within the linkcheck builder is fairly key to fixing this. Ideally it should be adapted, changed or inverted somehow so that we can check for multiple anchors within a document while only parsing the document once. Parsing should stop as soon as all anchors are found, the same way that it currently stops when it checks for a single anchor and finds it.
There may need to be some refactoring or clever logic to handle the fact that N source hyperlinks will distill down to slightly-fewer-than-N URIs to retrieve when some of them share anchor targets (http://localhost/foo#main-heading, http://localhost/foo and http://localhost/foo#second-heading - 3 hyperlinks, but only 1 URI to check). Currently our datastructures and end condition for the check worker thread assume a 1-1 mapping between source-hyperlinks and reported-results.

I hope to pick this up again in a few weeks' time but wanted to leave some notes for anyone learning about it and/or working on it.

Mar 25 '24 23:03 jayaddison

There are a few more steps required before this can be closed @AA-Turner - #12206 is one step closer, but doesn't yet resolve this.

Apr 24 '24 18:04 jayaddison

sphinx sphinx copied to clipboard

linkcheck performance: downloading page multiple times when checking anchors

Problem

Procedure to reproduce the problem

Expected results

Reproducible project / your project

Environment info

sphinx
sphinx copied to clipboard