linkchecker
linkchecker copied to clipboard
Support directory URLs for the AnchorCheck plugin
Summary
The AnchorCheck plugin doesn't support URLs that don't point directly to a file, which is quite common. For example, by default MkDocs enables use_directory_urls
to provide nicer looking URLs:
Source file | use_directory_urls: true | use_directory_urls: false |
---|---|---|
index.md | / | /index.html |
api-guide.md | /api-guide/ | /api-guide.html |
about/license.md | /about/license/ | /about/license.html |
cc @cjmayo @pacenathan to confirm
Steps to reproduce
- https://hatch.pypa.io/latest/install/
-
git clone https://github.com/pypa/hatch
-
cd hatch
- In this file (or any other) append a
foo
to any anchor link outside this file -
hatch run docs:build
-
hatch run docs:validate
- Notice success
-
hatch run docs:build-check
- Notice warning/error when running MkDocs with
--no-directory-urls
Actual result
Success if unknown anchor
Expected result
Failure if unknown anchor
Environment
- Operating system: Windows 10
- Linkchecker version: d9265bb71c2054bf57b8c5734a4825d62505c779
- Python version: 3.10.7
- Install method: from git
Configuration file
[AnchorCheck]
Logs
linkchecker --config .linkcheckerrc site
I did some experiments with some simple HTML files and AnchorCheck was able to detect whether something like:
<a href="/test/#anchor">
existed or not.
Looks like at the original time of posting the above steps would not have run AnchorCheck because .linkcheckerrc
wasn't committed until later: https://github.com/pypa/hatch/commit/55102457eb8c4d9128584d7197434b9e29ff64b2
Or maybe not, commit timings are close but in sequence.
I think he's saying that if the target link is foo/bar.html#broken
then linkchecker will report that as broken, but if the target link looks like foo/bar/#broken
then linkchecker will not give a warning, even if broken
is missing from the page at that URL.
I tested, and linkchecker DOES report the problem if you run the test against a web server, but it does NOT report the problem if you run against the filesystem. So presumably there's some interaction between the index.html
handling and the anchor checking, when processing HTML from the filesystem.
OTOH, when I ran linkchecker against the built (hatch) site running under a local web server, that build ALSO reported a full-on broken link from /users/
to /dev/users/
. I verified in the browser and the filesystem that there really is a broken link there. That link wasn't reported from the file-based check, either.
(I should have specified - I ran these tests both from this repo's linkchecker and from my fork, and they both showed the same behavior, so there's no fix already in the wings.)
A few clues, from a --verbose
run against the filesystem:
URL `../config/build/#build-targets-foo'
Name `target'
Parent URL file:///Users/truist/src/hatch/site/build/index.html, line 9, col 26382
Real URL file:///Users/truist/src/hatch/site/config/build/#build-targets-foo
Check time 0.000 seconds
Result Valid: directory
URL `/dev/users/'
Name `development version'
Parent URL file:///Users/truist/src/hatch/site/users/index.html, line 9, col 27015
Real URL file:///dev/users/
Info The URL is outside of the domain filter, checked only
syntax.
Result Valid: filtered
I think he's saying that if the target link is
foo/bar.html#broken
then linkchecker will report that as broken, but if the target link looks likefoo/bar/#broken
then linkchecker will not give a warning, even ifbroken
is missing from the page at that URL.
Yes exactly!
broken link from
/users/
to/dev/users/
Ah, that's the versioning https://github.com/pypa/hatch/tree/gh-pages
Oh, I see - you explicitly used an absolute URL there, which seems to result in linkchecker ignoring it. I didn't realize it did that. (Convenient!)
OK, so that's a red herring, and unrelated to this issue with the anchors.
@cjmayo the issue with the anchors seems to be because of this line, which seems to treat all links to directories as valid, short-circuiting the check for anchors. I assume that's intentional, for sites that generate directory listings automatically?
So I'm not sure about the right fix for this for AnchorCheck. The obvious thing would just be to check inside the directory for an index.html
, but that is based on a number of assumptions that might not be true. (But neither is it safe to assume that there will be a web server generating a directory listing.)
For my personal purposes (i.e. mkdocs), a fix that just assumes index.html
would be sufficient. But that might not be general-purpose enough. Ideally there'd be some sort of configuration option to define this behavior.
FYI I updated my fork just to show a warning in this case, because we don't need this case to work, because we don't use use_directory_urls
(in mkdocs), because we need offline support.
@cjmayo this might be a good short-term fix to linkchecker in general, to at least declare that the check results are incomplete in this case, until there's a real fix for this issue.
I didn't make a PR because I haven't reconciled my branches with your latest changes, so I have no easy way to make a common base. You might be able to cherry-pick the change cleanly, though.
Yes, I rushed in and only checked http, and didn't even make that clear...
Indeed it seems an 'index.html' configuration option would likely be needed, better than having an algorithm to manage. Plus then probably a different warning to indicate index.html couldn't be found, rather than just the anchor.
And then a bunch of AnchorCheck only code to implement this.
I tend towards reverting the file: support for now #681 rather than having something that partially works, FileUrl then goes back to handling file: schemes as per the spec, and we are still ahead in that AnchorCheck, as was, is now working. Then there is room for a solution to the plugin problem which doesn't need code in the core. Maybe even not trying to emulate an http server with files but actual run an http server.
Sphinx would be able to pickup internal link problems at build time. I haven't used mkdocs - there is a strict option - does that do it?
Sphinx would be able to pickup internal link problems at build time. I haven't used mkdocs - there is a strict option - does that do it?
Yes
Just my two cents - AnchorCheck works just fine with regular files, and that's actually my primary use-case for it. #681 would force me to look for an entirely different tool, if I didn't already have my own fork. I assume that would be the primary use case for many other people using mkdocs, too.
On mkdocs strict: mkdocs explicitly does not support link checking. See here for where they say that. They only check links for pages they think they need to generate, so if you typo a link with something like [some text](path/oops_I_forgot_the_md_file_extension)
it will happily generate a page with a link to a non-existent page, and not warn you about it. That, plus the lack of anchor checking, plus the lack of external link checking, is what originally led me here.
Works for me:
❯ hatch run docs:build
INFO - Cleaning site directory
INFO - Building documentation to directory: C:\Users\ofek\Desktop\code\hatch\site
WARNING - Documentation file 'build.md' contains a link to 'config/foo-build.md' which is not found in the documentation files.
Aborted with 1 warnings in strict mode!
but yes no anchor checking which is why I'm here.
Just my two cents - AnchorCheck works just fine with regular files, and that's actually my primary use-case for it. https://github.com/linkchecker/linkchecker/pull/681 would force me to look for an entirely different tool, if I didn't already have my own fork. I assume that would be the primary use case for many other people using mkdocs, too.
Same, please don't 🙏
@ofek in your example, try changing foo-build.md
to just foo-build
, and build. You'll get a rendered HTML page with a link in it, but the link will 404.
Ah, I see 👍
I do understand it is useful (and quick) to be able to check files. In the background here is that I do want to make a new release soon. So even getting too diverted into thinking about this one wasn't what I was hoping for - but I am glad to have been let know.
I'll have a look at the warning patch.
Rebasing the patch was straightforward (there is a convention for warning names that the first bit after the WARN_ is the checker, warnings are documented in linkcheckerrc(5) etc.).
But when I did a bit more testing without AnchorCheck I saw that we had changed the Real URL being reported for these kind of URLs to include the anchor, even though it is the directory that is being checked (and this is then used as the Parent URL for each of the contents of that directory).
New proposal, #683 to create an AnchorCheckFileUrl class - and just keep everything separate while there are these unresolved issues.
I added a comment to one line of code in that PR, but I assume you're already aware of it. Otherwise this seems like a viable middle ground.
Thanks.
Looking back over old issues #206 is also a request to look for index.html when checking local files.
Has this been fixed?