sphinx fix #11446: linkcheck should split url only if # part of anchor

fixes #11446

as described in the issue, the link checker incorrectly processes URLs that have a # that is part of the directory path, rather than designating the anchor.

This pull request fixes this by splitting the URL using a regexp (.+/[^/]*)#([^/]*?/?$), which is a complicated way of saying we want to split the URL by # only if it is not followed by more forward slashes.

I've tested the implementation by hand on the URL https://microsoft.github.io/pyright/#/type-concepts-advanced as described in the issue and it works. So that nothing breaks in the future, I added a test that ensures that the following two URLs:

"https://www.example.com/path/to/resource#123"

"https://www.example.com/path/to/#/resource#123"

the anchor "123" is split off correctly. Although the test works, it's quite technical with the mockers that I used. If you guys have suggestions on how to improve, I'm all ears.

May 28 '23 22:05 rsokolewicz

I'm not a regular contributor here, so please take my feedback with a grain of salt. My concern about this change isn't related to the code, but to standards (in particular RFC3986).

I don't think that handling the URL fragments this way would match the RFC3986 specification, and I think (although could be wrong) that the rfc3986 Python library demonstrates that:

>>> import rfc3986
>>> rfc3986.urlparse("https://www.example.com/path/to/#/resource#123")
ParseResult(scheme='https', userinfo=None, host='www.example.com', port=None, path='/path/to/', query=None, fragment='/resource#123')

If we think that's a bug with rfc3986 perhaps we could report it there - my guess is that pyright is using some a nonstandard URL scheme here however.

Supporting that kind of thing is often fine short-term to solve an individual case, but over a longer duration of time it can lead to standards fragmentation and resulting maintenance burden, confusion and compatibility problems.

I'll try to think of ways that we could address this and also handle the use case -- and perhaps that'll involve speaking to the pyright documentation maintainers -- but I'd like to suggest proceeding cautiously with any code changes.

(again, my contributions and involvement here are fairly minimal, so feel free to ignore all of the above)

Jun 01 '23 17:06 jayaddison

@jayaddison pyright is not doing anything "wrong" per se. The use of /#/ is for SPA (single-page applications) that allows you to navigate between different pages without restarting the whole app and this only for the client framework router (for instance Angular's HashLocationStrategy vs PathLocationStrategy).

While I commented the PR without considering RFC3986, I agree that it might be better to enforce RFC3986 by default. We could, however, hard-detect whether /#/ is in the URI or not instead of using a regular expression, and then, split the path + query according to RFC.

Jun 02 '23 08:06 picnixz

Yep, I've seen a few single-page applications use similar techniques, and since they can inspect the current URL from within their JavaScript code, they can implement patterns like this.

We don't have a JavaScript runtime within Sphinx, though, and it wouldn't seem great to follow a path of trying to implement edge case handling for all possible application configurations (that's the path I think that Sphinx should avoid, while considering alternative routes to achieve everyone's desired goals. I have a vague memory that there have been attempts to specify this in the past to make fragment-URLs crawlable by search engines).

Jun 02 '23 09:06 jayaddison

What about hard-detecting /#/ then (instead of a regex as done by this PR?)

Jun 02 '23 09:06 picnixz

Using /#/ as a special case would introduce the same problem: it'd essentially define an ad-hoc approach for how to handle some subset of URLs. I wouldn't be surprised if soon after (or, equally problematically, months or years after), we'd encounter requests about sites that use /#/ with a different anchor scheme - we end up adding further specialisations and customisations, often with conflicting objectives.

Where possible I think it's better to determine the objective, discuss that widely, gather feedback on ideas and problems, perhaps prototype an approach, and codify a standard (and then vote and agree upon it). It may seem like the standardisation process can take a long time, but I think that in the long term it's a more respectful approach and saves time (because less work is wasted, and systems become interoperable).

Jun 02 '23 09:06 jayaddison

we end up adding further specialisations and customisations, often with conflicting objectives.

(or we end up coming back here and saying 'nope, sorry - we changed our minds as maintainers and we don't want to support /#/ this way any more, so we shouldn't have implemented this change initially. possibly OK, depending on the way it's communicated and how urgent the use case is, but not a great experience for anyone, I think)

Jun 02 '23 09:06 jayaddison

Fair enough. Interestingly,

https://microsoft.github.io/pyright/#type-concepts-advanced

points to the same URL as

https://microsoft.github.io/pyright/#/type-concepts-advanced,

Jun 02 '23 09:06 picnixz

Everything after the first fragment identifier (#) is local to the client (and in fact not seen by the server), is my approximate understanding - so yep, I think that makes sense.

Jun 02 '23 09:06 jayaddison

my approximate understanding

Don't worry you are correct. I only wanted to point this fact for (possible) future discussion.

Jun 02 '23 09:06 picnixz

This has conflicts (#11432).

The points raised in discussion are hard to resolve, as single-page applications don't behave 'normally', and having anchors on top of anchors is by necessity framework-dependent behaviour.

Perhaps not detecting any further / characters is a good-enough hueristic, though I'm unsure.

A

Jul 20 '23 20:07 AA-Turner

sphinx sphinx copied to clipboard

fix #11446: linkcheck should split url only if # part of anchor

sphinx
sphinx copied to clipboard