Exclude rel="canonical" links from checking
Lychee currently checks links in <link rel="canonical"> elements. This causes problems for me, when I use lychee in a pre-deployment check and a new page is about to be added to the site. I would also argue that canonical page links are metadata about the current page, not links that should be validated. The links in question occur in the html header and look like this:
<link rel="canonical" href="https://www.example.com/some/page.html"/>
Suggestion: Links with rel="canonical" should be excluded from checking, similar to how rel="nofollow", rel="preconnect", and rel="dns-prefetch" are already excluded.
If never following rel="canonical" links is too intrusive, maybe following the links could at least be implemented while lychee is in directory traversal mode, if if some command line option is set?
I think it makes more sense to check the links while redirecting links to the production website to local files. This is possible by a certain combination of lychee flags and features. If this would work for your case, I can write a docs page on how to do so - I have been meaning to write about this.
Similar to https://github.com/lycheeverse/lychee/issues/1594?
Thank you for your prompt answer!
Yes, I think my report is a duplicate of #1594, sorry for not spotting this.
And yes, if I could say that files in some/folder/somewhere/ will be available later at http://some.host.com/ this would solve my problem. What I mean is: to check the link http://some.host.com/a/b/c.html, it would be enough to make sure that some/folder/somewhere/a/b/c.html exists. Links to different hosts should still be checked using http. Maybe http://some.host.com/a/b/ would need to check whether some/folder/somewhere/a/b/index.html exists?
Lovely! To make that work, you need to use something like
lychee some/folder --root-dir $(pwd)/some/folder --index-files index.html --remap "https://somehost\.com file://$(pwd)/some/folder"
Let me know if this command works for you!
The remap feature is documented briefly on the website but the page is a bit too general. I'd like to write up a page about this specific use case, because I think it's fairly common and has some nuances.
True, canonical links are quite important and should be checked in general. The reason is that they point to the "true origin" of a resource and if they brake, that's a major issue with the content. In the worst case, it would mean that search engines can no longer properly index the page. So fixing that issue on the user-side as @katrinafyi showed is the way to go.
@mre I agree, but of course for pre-deployment checks of new content the page will not yet be present at the canonical location at the time of check. I didn't have time to try this yet, but I believe the remap feature will solve the problem.
Absolutely. We should document that workaround. 👍