sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

linkcheck performance: downloading page multiple times when checking anchors

Open rleigh-codelibre opened this issue 7 years ago • 3 comments

Problem

  • If my sphinx documentation contains multiple links with anchors to a web page with multiple anchors, it will download the page multiple times, once per anchor to check
  • This scales very badly. If I have many hundreds or thousands of anchors (e.g. for automatically generated documentation), it might download several megabytes × the number of links. This can end up being multiple gigabytes

Procedure to reproduce the problem

  • create a document with links to anchors on the same web page
  • run the link checker; it will fetch the page multiple times

Expected results

  • I would suggest that the link checker could cache the anchors on webpages, so that it only downloads each page once, and only checks each link once. It could build a dictionary of pages to check, and store the anchors as a list or dict within it? Since we know up front which of our links have anchors, we can skip storing them when we know it's unnecessary.
  • There may be other better ways of doing this; I'm not familiar with the internals of the link checker.

Reproducible project / your project

  • https://github.com/openmicroscopy/bioformats/tree/develop/docs/sphinx
  • contains lots of links to https://www.openmicroscopy.org/Schemas/Documentation/Generated/OME-2016-06/ome_xsd.html

Environment info

  • OS: Any
  • Python version: Any
  • Sphinx version: Any

rleigh-codelibre avatar Dec 15 '17 09:12 rleigh-codelibre

cc @jayaddison

AA-Turner avatar Jan 15 '24 07:01 AA-Turner

I'm taking a break from development here for a little while, but had begun investigation and considered an approximate design for how to resolve this:

  • We have a ConnectionMeasurement helper in our linkcheck tests that inspects the number of HTTP connections made when unit tests run. Following test-driven-development, we could lower the target value for that, and the fixing this bug should help to confirm the results (not the only test coverage required, but a useful confirmation).
  • The contains_anchor function within the linkcheck builder is fairly key to fixing this. Ideally it should be adapted, changed or inverted somehow so that we can check for multiple anchors within a document while only parsing the document once. Parsing should stop as soon as all anchors are found, the same way that it currently stops when it checks for a single anchor and finds it.
  • There may need to be some refactoring or clever logic to handle the fact that N source hyperlinks will distill down to slightly-fewer-than-N URIs to retrieve when some of them share anchor targets (http://localhost/foo#main-heading, http://localhost/foo and http://localhost/foo#second-heading - 3 hyperlinks, but only 1 URI to check). Currently our datastructures and end condition for the check worker thread assume a 1-1 mapping between source-hyperlinks and reported-results.

I hope to pick this up again in a few weeks' time but wanted to leave some notes for anyone learning about it and/or working on it.

jayaddison avatar Mar 25 '24 23:03 jayaddison

There are a few more steps required before this can be closed @AA-Turner - #12206 is one step closer, but doesn't yet resolve this.

jayaddison avatar Apr 24 '24 18:04 jayaddison