linkchecker Support directory URLs for the AnchorCheck plugin

Summary

The AnchorCheck plugin doesn't support URLs that don't point directly to a file, which is quite common. For example, by default MkDocs enables use_directory_urls to provide nicer looking URLs:

Source file	use_directory_urls: true	use_directory_urls: false
index.md	/	/index.html
api-guide.md	/api-guide/	/api-guide.html
about/license.md	/about/license/	/about/license.html

cc @cjmayo @pacenathan to confirm

Steps to reproduce

https://hatch.pypa.io/latest/install/
git clone https://github.com/pypa/hatch
cd hatch
In this file (or any other) append a foo to any anchor link outside this file
hatch run docs:build
hatch run docs:validate
Notice success
hatch run docs:build-check
Notice warning/error when running MkDocs with --no-directory-urls

Actual result

Success if unknown anchor

Expected result

Failure if unknown anchor

Environment

Operating system: Windows 10
Linkchecker version: d9265bb71c2054bf57b8c5734a4825d62505c779
Python version: 3.10.7
Install method: from git

Configuration file

[AnchorCheck]

Logs

linkchecker --config .linkcheckerrc site

Oct 16 '22 21:10 ofek

I did some experiments with some simple HTML files and AnchorCheck was able to detect whether something like: <a href="/test/#anchor"> existed or not.

Looks like at the original time of posting the above steps would not have run AnchorCheck because .linkcheckerrc wasn't committed until later: https://github.com/pypa/hatch/commit/55102457eb8c4d9128584d7197434b9e29ff64b2

Oct 17 '22 18:10 cjmayo

Or maybe not, commit timings are close but in sequence.

Oct 17 '22 18:10 cjmayo

I think he's saying that if the target link is foo/bar.html#broken then linkchecker will report that as broken, but if the target link looks like foo/bar/#broken then linkchecker will not give a warning, even if broken is missing from the page at that URL.

I tested, and linkchecker DOES report the problem if you run the test against a web server, but it does NOT report the problem if you run against the filesystem. So presumably there's some interaction between the index.html handling and the anchor checking, when processing HTML from the filesystem.

OTOH, when I ran linkchecker against the built (hatch) site running under a local web server, that build ALSO reported a full-on broken link from /users/ to /dev/users/. I verified in the browser and the filesystem that there really is a broken link there. That link wasn't reported from the file-based check, either.

Oct 17 '22 19:10 pacenathan

(I should have specified - I ran these tests both from this repo's linkchecker and from my fork, and they both showed the same behavior, so there's no fix already in the wings.)

Oct 17 '22 19:10 pacenathan

A few clues, from a --verbose run against the filesystem:

URL        `../config/build/#build-targets-foo'
Name       `target'
Parent URL file:///Users/truist/src/hatch/site/build/index.html, line 9, col 26382
Real URL   file:///Users/truist/src/hatch/site/config/build/#build-targets-foo
Check time 0.000 seconds
Result     Valid: directory

URL        `/dev/users/'
Name       `development version'
Parent URL file:///Users/truist/src/hatch/site/users/index.html, line 9, col 27015
Real URL   file:///dev/users/
Info       The URL is outside of the domain filter, checked only
           syntax.
Result     Valid: filtered

Oct 17 '22 19:10 pacenathan

I think he's saying that if the target link is foo/bar.html#broken then linkchecker will report that as broken, but if the target link looks like foo/bar/#broken then linkchecker will not give a warning, even if broken is missing from the page at that URL.

Yes exactly!

broken link from /users/ to /dev/users/

Ah, that's the versioning https://github.com/pypa/hatch/tree/gh-pages

Oct 17 '22 20:10 ofek

Oh, I see - you explicitly used an absolute URL there, which seems to result in linkchecker ignoring it. I didn't realize it did that. (Convenient!)

OK, so that's a red herring, and unrelated to this issue with the anchors.

Oct 17 '22 20:10 pacenathan

@cjmayo the issue with the anchors seems to be because of this line, which seems to treat all links to directories as valid, short-circuiting the check for anchors. I assume that's intentional, for sites that generate directory listings automatically?

So I'm not sure about the right fix for this for AnchorCheck. The obvious thing would just be to check inside the directory for an index.html, but that is based on a number of assumptions that might not be true. (But neither is it safe to assume that there will be a web server generating a directory listing.)

For my personal purposes (i.e. mkdocs), a fix that just assumes index.html would be sufficient. But that might not be general-purpose enough. Ideally there'd be some sort of configuration option to define this behavior.

Oct 17 '22 20:10 pacenathan

FYI I updated my fork just to show a warning in this case, because we don't need this case to work, because we don't use use_directory_urls (in mkdocs), because we need offline support.

@cjmayo this might be a good short-term fix to linkchecker in general, to at least declare that the check results are incomplete in this case, until there's a real fix for this issue.

I didn't make a PR because I haven't reconciled my branches with your latest changes, so I have no easy way to make a common base. You might be able to cherry-pick the change cleanly, though.

Oct 18 '22 14:10 pacenathan

Yes, I rushed in and only checked http, and didn't even make that clear...

Indeed it seems an 'index.html' configuration option would likely be needed, better than having an algorithm to manage. Plus then probably a different warning to indicate index.html couldn't be found, rather than just the anchor.

And then a bunch of AnchorCheck only code to implement this.

I tend towards reverting the file: support for now #681 rather than having something that partially works, FileUrl then goes back to handling file: schemes as per the spec, and we are still ahead in that AnchorCheck, as was, is now working. Then there is room for a solution to the plugin problem which doesn't need code in the core. Maybe even not trying to emulate an http server with files but actual run an http server.

Sphinx would be able to pickup internal link problems at build time. I haven't used mkdocs - there is a strict option - does that do it?

Oct 18 '22 18:10 cjmayo

Sphinx would be able to pickup internal link problems at build time. I haven't used mkdocs - there is a strict option - does that do it?

Yes

Oct 18 '22 18:10 ofek

Just my two cents - AnchorCheck works just fine with regular files, and that's actually my primary use-case for it. #681 would force me to look for an entirely different tool, if I didn't already have my own fork. I assume that would be the primary use case for many other people using mkdocs, too.

On mkdocs strict: mkdocs explicitly does not support link checking. See here for where they say that. They only check links for pages they think they need to generate, so if you typo a link with something like [some text](path/oops_I_forgot_the_md_file_extension) it will happily generate a page with a link to a non-existent page, and not warn you about it. That, plus the lack of anchor checking, plus the lack of external link checking, is what originally led me here.

Oct 18 '22 19:10 pacenathan

Works for me:

❯ hatch run docs:build
INFO     -  Cleaning site directory
INFO     -  Building documentation to directory: C:\Users\ofek\Desktop\code\hatch\site
WARNING  -  Documentation file 'build.md' contains a link to 'config/foo-build.md' which is not found in the documentation files.

Aborted with 1 warnings in strict mode!

but yes no anchor checking which is why I'm here.

Just my two cents - AnchorCheck works just fine with regular files, and that's actually my primary use-case for it. https://github.com/linkchecker/linkchecker/pull/681 would force me to look for an entirely different tool, if I didn't already have my own fork. I assume that would be the primary use case for many other people using mkdocs, too.

Same, please don't 🙏

Oct 18 '22 20:10 ofek

@ofek in your example, try changing foo-build.md to just foo-build, and build. You'll get a rendered HTML page with a link in it, but the link will 404.

Oct 18 '22 21:10 pacenathan

Ah, I see 👍

Oct 18 '22 21:10 ofek

I do understand it is useful (and quick) to be able to check files. In the background here is that I do want to make a new release soon. So even getting too diverted into thinking about this one wasn't what I was hoping for - but I am glad to have been let know.

I'll have a look at the warning patch.

Oct 19 '22 18:10 cjmayo

Rebasing the patch was straightforward (there is a convention for warning names that the first bit after the WARN_ is the checker, warnings are documented in linkcheckerrc(5) etc.).

But when I did a bit more testing without AnchorCheck I saw that we had changed the Real URL being reported for these kind of URLs to include the anchor, even though it is the directory that is being checked (and this is then used as the Parent URL for each of the contents of that directory).

New proposal, #683 to create an AnchorCheckFileUrl class - and just keep everything separate while there are these unresolved issues.

Oct 24 '22 18:10 cjmayo

I added a comment to one line of code in that PR, but I assume you're already aware of it. Otherwise this seems like a viable middle ground.

Oct 25 '22 13:10 pacenathan

Thanks.

Looking back over old issues #206 is also a request to look for index.html when checking local files.

Oct 25 '22 18:10 cjmayo

Has this been fixed?

Feb 28 '24 19:02 ofek

linkchecker linkchecker copied to clipboard

Support directory URLs for the AnchorCheck plugin

Summary

Steps to reproduce

Actual result

Expected result

Environment

Configuration file

Logs

linkchecker
linkchecker copied to clipboard