sphinx
sphinx copied to clipboard
linkcheck performance: downloading page multiple times when checking anchors
Problem
- If my sphinx documentation contains multiple links with anchors to a web page with multiple anchors, it will download the page multiple times, once per anchor to check
- This scales very badly. If I have many hundreds or thousands of anchors (e.g. for automatically generated documentation), it might download several megabytes × the number of links. This can end up being multiple gigabytes
Procedure to reproduce the problem
- create a document with links to anchors on the same web page
- run the link checker; it will fetch the page multiple times
Expected results
- I would suggest that the link checker could cache the anchors on webpages, so that it only downloads each page once, and only checks each link once. It could build a dictionary of pages to check, and store the anchors as a list or dict within it? Since we know up front which of our links have anchors, we can skip storing them when we know it's unnecessary.
- There may be other better ways of doing this; I'm not familiar with the internals of the link checker.
Reproducible project / your project
- https://github.com/openmicroscopy/bioformats/tree/develop/docs/sphinx
- contains lots of links to https://www.openmicroscopy.org/Schemas/Documentation/Generated/OME-2016-06/ome_xsd.html
Environment info
- OS: Any
- Python version: Any
- Sphinx version: Any
cc @jayaddison
I'm taking a break from development here for a little while, but had begun investigation and considered an approximate design for how to resolve this:
- We have a
ConnectionMeasurementhelper in ourlinkchecktests that inspects the number of HTTP connections made when unit tests run. Following test-driven-development, we could lower the target value for that, and the fixing this bug should help to confirm the results (not the only test coverage required, but a useful confirmation). - The
contains_anchorfunction within thelinkcheckbuilder is fairly key to fixing this. Ideally it should be adapted, changed or inverted somehow so that we can check for multiple anchors within a document while only parsing the document once. Parsing should stop as soon as all anchors are found, the same way that it currently stops when it checks for a single anchor and finds it. - There may need to be some refactoring or clever logic to handle the fact that
Nsource hyperlinks will distill down to slightly-fewer-than-NURIs to retrieve when some of them share anchor targets (http://localhost/foo#main-heading,http://localhost/fooandhttp://localhost/foo#second-heading- 3 hyperlinks, but only 1 URI to check). Currently our datastructures and end condition for the check worker thread assume a 1-1 mapping between source-hyperlinks and reported-results.
I hope to pick this up again in a few weeks' time but wanted to leave some notes for anyone learning about it and/or working on it.
There are a few more steps required before this can be closed @AA-Turner - #12206 is one step closer, but doesn't yet resolve this.