Add support for anchor tags
Links that point to non existent anchors https://github.com/lenisha/linkchecker-test/blob/main/about.html#L92 are not detected
<a href="https://github.com/Microsoft/onnxruntime#api-documentation" target="_blank" class="link">
At this point, they're not supported.
This is a bit of a tricky one. Anchors are really the properties of the document contents, not really being integral parts of the URLs.
What I really mean by that is this:
- You can't tell if an anchor is valid by merely making a request (you can only check the URL path without the anchor)
- To verify anchors, you'd have to parse the retrieved HTML and check they're there (but we already do it for link extraction using html5ever, a browser grade parser)
So this might not be very straightforward, but should be perfectly doable. I'd welcome such a feature :smiley:
To verify anchors, you'd have to parse the retrieved HTML and check they're there (but we already do it for link extraction using html5ever, a browser grade parser)
HTMLs are parsed for link extraction on the input HTML documents, but not for the target HTML documents. So it would required to first check whether the target is actually an HTML document, then download it fully (currently it is only partially downloaded or not at all when using -X head) and then parse the target HTML for a matching id attribute.
When support for internal links via file existence check is added (#21), then it could be at least cheaply added for those:
- If the link/URL is an internal one, check only if the target file exists, relative to the input HTML file location or relative to the base path, when given and it is an absolute internal link.
- Then, if the target is an HTML file, or a directory that contains an
index.htm/index.html, check for a contained matchingidattribute.
In other cases, I think it's difficult if one does not want to imply much overhead for checking the URLs document type and downloading it completely. But nevertheless, the idea/feature itself would be awesome of course 🙂.
For reference the hint by @polarathene: https://github.com/lycheeverse/lychee/issues/21#issuecomment-859969005
Another case where the difficulties with target downloads etc do not apply, and probably this is what you had in mind anyway, is to check for scroll links within the input HTMLs. so if there is an <a href="#heading">, to check whether there is an id="heading" as well. If such is wanted, it's good to create a generic method/function for it, so that URL fragments can be generally detected and HTMLs generally scanned for matching IDs, but at first implemented for such same-page scroll links, then depending on #21 for locally stored target HTML files, and gradually for online HTML documents, when this is found to be important enough to fully download the target, based on content-type: text/html or so. But I'm not sure if the content type header is reliable enough of if there is a risk that a large archive or video file is unintentionally downloaded 😄.
But I'm not sure if the content type header is reliable enough of if there is a risk that a large archive or video file is unintentionally downloaded
Response headers should provide info about content size, completing a request in full (eg downloading a large resource) shouldn't be necessary and could be terminated early AFAIK?
completing a request in full (eg downloading a large resource) shouldn't be necessary and could be terminated early AFAIK?
If you want to check whether an URL fragment is valid or not, you need to download the whole document of course, which is the whole issue I see here, as long as it's not about same-page links or local files. Currently the download is terminated early, reasonably, so currently it is impossible to apply this feature for remote HTML documents without changing the download behaviour based on such things like HTTP headers, which is why I think one needs to evaluate carefully if the effort is worth it or not 😉.
you need to download the whole document of course, which is the whole issue I see here
Downloading a complete HTML document to operate on isn't much of a concern vs larger files which I thought was being raised as the concern but could be avoided.
You make a good point though, especially with some HTML documents that are dependent on JS to populate the real content, such as React pages without any pre-rendering process involved. That'd require actually loading and running the JS..
I haven't used lychee myself yet, so I'm not sure if it supports that well (anchor tag feature aside).
If other attributes of 'a' tag is provided, you can use that to help you determine whether to download that webpage, or whether it links to webpage, etc. But it gets complicated.
https://html.spec.whatwg.org/multipage/links.html#links
Anchor tags are supported by https://github.com/killercup/link-checker and the implementation is done with a HashSet of anchor links per page. https://github.com/killercup/link-checker/blob/master/src/links.rs. Something to look into if we get around to adding it to lychee.
Sometimes anchors aren’t even in the HTML itself, like for static websites! Another approach could be to make the request with and without the anchor and compare the scroll location if that’s possible, it means making two requests though
make the request with and without the anchor and compare the scroll location if that’s possible
that requires emulation of webpage rendering and is outside the scope of this project.
Also dynamic websites send an HTML document with element IDs, or do you mean scrolling via JavaScript? Usually JavaScript is just used to better control smooth scrolling, in addition to the still present functional element IDs for clients with no JavaScript support.
If there is really no matching ID, then I agree that the required full browser features, including JavaScript engine etc are out of scope. Also I can't imagine that anyone wants to wait 5-10 seconds two times for every checked link, required to load all assets, render the page, execute JavaScript, finish all events etc. Furthermore, since many websites have no definit "all done", but load/show additional dynamic content (ads, cookie banners, notification banners, ...), random seconds after the relevant content has been rendered already, including content shift and changing scroll position, it is practically impossible to do such comparison reliably without a lot of false negatives and without giving a very long grace period (30 seconds?) to minimise the risk of still expanding content.
it might be worth splitting up the scope here. For my use case (a technical specification), I have a bunch of local .md files. I'd like to see that the links within those files are valid: they go to a valid file (already checked), and to a valid target anchor among the spec files. Also, I'd like to see that the http/s urls at least go to a valid place. As noted it could be challenging to ensure an anchor is valid on a dynamic site. I'm less interested in whether the external HTML files have the right anchor— those are likely to move around and not everyone tries to keep stable links.
Perhaps a way forward would be to have some way to determine which filetypes and schemes are subject to anchor checking.
In my case: filetype *.md and scheme file: only.
Other users, such as the original author, may make it more broad such as including http: and https:
I've created https://github.com/lycheeverse/lychee/pull/1126 with an implementation of anchor/fragment checking for links to markdown files. Do comment with your thoughts. Is it roughly what you would like?
yes seems solving this as a whole isnt so simple but this solves a part of it
how does it determine if a link follows the md format? checking the pr it seems to check if the link ends with .md
Here is a rather verbose message with my thoughts.
For fragment checking, we are only concerned with where links are pointing to and there are five forms that come to mind:
-
file:///foo/bar.md#frag -
file:///foo/bar.html#frag -
https://foo/bar.html#frag -
https://foo/bar#frag -
https://foo/bar.md#frag
There are two file types markdown and HTML that it sounds like people are interested in checking the fragments within. It would make sense to use the extension in the link for determine the file type, i.e. .md and .html. The exception is Case 4 where one could assume it is a HTML file because it is being accessed through HTTP.
These files can be accessed through the file system (file:) or HTTP (http:/https:).
I would like lychee to at least support Cases 1, 2 and 3, with separate options markdown and html fragment checking. I don't have a strong opinion about Cases 4 and 5 at the moment.
I think HTML fragment checks should be opt-in, e.g. --include-html-fragments, due to false negatives on dynamic sites. I feel markdown fragment checks should be opt-out, e.g. --ignore-md-fragments.
Another decision that needs to be made is how we generate fragments from a markdown file. We could choose one method that we feel is the normal method or we could offer multiple methods. If we are offering multiple methods, we would have to make sure the configuration isn't overwhelming.
The most common explicit generation method appears to be heading attributes. These aren't implemented in the CommonMark spec, but are supported by pulldown-cmark and are used commonly, in python-markdown for example.
# Heading Attributes
## A Title {#a_title}
The above heading will be given `a_title` as an id.
The most common automatic generation method seems to be what I call unique kebab case, where headings are kebab cased (lower-cased with whitespaces are replaced by -). When two headings produce the same kebab case fragment, a number is appended to make it unique. This is used by GitHub, GitLab and mdBook.
# Unique Kebab Case
## The Title
The above heading will have `the-title` as an id.
## A Title
The above heading will have `a-title-1` as an id.
## A Title
The above heading will have `a-title-2` as an id.
If we want to provide a simple interface with only one generation approach, I think we should either use heading attributes and fall back to unique kebab case when not present, or use both unique kebab case and heading attributes (when present).
Note selectors spec:
If an element has multiple ID attributes, all of them must be treated as IDs for that element for the purposes of the ID selector.
Summary of the Open Questions
- Which of the 5 link types we should support fragment checks for?
- Whether they should be opt-in or opt-out?
- Which method(s) of generating fragments from markdown should we implement? Should we support multiple? If so, how do we expose these options to the user?
What has been done so far?
In https://github.com/lycheeverse/lychee/pull/1126, I have just implemented checks for the case 1 links, i.e. file:///foo/bar.md#frag links pointing to files with .md extensions.
It is missing testing and not ready to be merged. It currently offers multiple different fragment generators and doesn't support in-line html within the markdown file.
Next Steps
To get this first PR in we need to answer questions 2 and 3. After this has been merged, I can start working on getting the additional link types (that we later agree on) working.
how does it determine if a link follows the md format? checking the pr it seems to check if the link ends with .md
Yep, it just looks at the extension.
Thanks for the summary.
From my perspective, the answers to your open questions would be as follows:
- Which of the 5 link types we should support fragment checks for?
I agree that we should tackle 1-3. The other two cases seem to be outliers, which we could tackle later if needed.
- Whether they should be opt-in or opt-out?
If it turns out that fragment checking is robust enough, I would not mind enabling it by default. By "robust enough" I mean that the number of false-positives is below a threshold of, say, 1% of all runs reporting a broken fragment, which in fact is working. I would not mind false-negatives too much. So I'd say, let's stay conservative and not report an error if we're uncertain. As for the flags, I'd vote for --include-fragments or --exclude-fragements, as this is the naming scheme for other inclusions/exclusions right now and follows the principle of least surprise. Whether we want to differentiate between e.g. --include-md-fragments and --include-html-fragments can be answered later. Ideally, I'd prefer a single option, but if one check is much more robust than the other, it might merit a separate flag.
- Which method(s) of generating fragments from markdown should we implement? Should we support multiple? If so, how do we expose these options to the user?
I'd keep it simple for now. We can always add another option later to specify the detection algorithm, e.g. --fragment-generation kebap-case.
If we want to provide a simple interface with only one generation approach, I think we should either use heading attributes and fall back to unique kebab case when not present, or use both unique kebab case and heading attributes (when present).
Yes, fully agree. A combination of heading attributes and unique kebab case would work. We can use the examples from the Gitlab docs for our tests.
i think checking html files should be higher priority because majority of the time it'll be case 4, though im not sure how we'd do this, we'd probably need to parse the html and check for <hn> tags
i agree this should be opt-in because many websites dont support anchors, or they use anchors differently
another, possibly less efficient approach, would be to compare the response with and without the anchor to see if they're the same, this'd definitely be opt-in
this though isn't enough for websites/browsers that use anchors to scroll, which many of them do so i think that'd require users to provide a browser api, then we'd check if it scrolled with the anchor or not
i think checking html files should be higher priority because majority of the time it'll be case 4
Yes, websites without a file extension are indeed a common use-case.
though im not sure how we'd do this, we'd probably need to parse the html and check for
<hn>tags
We already parse HTML, and we have access to all tags, thanks to html5ever and html5gum, so fetching the <hn> won't be an issue.
i agree this should be opt-in because many websites dont support anchors, or they use anchors differently
AFAIK, anchors are mainly a browser feature. They are not even sent to the website and are only handled client-side. You can confirm that by sending a request with an anchor tag to any website; it will be stripped out before the request.
Compare the URL in the address bar with the URL in the network tab:
JavaScript might also handle anchors, as they have access to the address bar. This is mainly used for dynamic websites I guess.
No browser API is needed, as it's enough to know that the anchor exists on the website that the browser tries to render. We don't have to emulate the actual rendering. Just the anchor id generation.
There's a chance I misunderstood your point.
i meant that for mainly spas or ssr, some websites use anchors kind of like paths or queries, for example example.com#anchor is like example.com/anchor i think this is dome to keep state of what "page" the user is in, so they can show that on reload or going back etc
these sites are a minority tho so we could skip over this if its too much work (probably is)
Ah, you got it. Thanks for the explanation. Yes, SPAs will be tricky for now. We might not be able to check these. Let's start with the simple use-case of static / server side rendered pages, which will already be a great step forward.
I was wrong about GitLab and GitHub using the same fragment generation. I've created some test cases to compare common forges:
| Heading | Forgejo | GitHub | GitLab |
|---|---|---|---|
# Repeated Heading |
repeated-heading |
repeated-heading |
repeated-heading |
## This header has a :thumbsup: in it |
this-header-has-a-thumbsup-in-it |
this-header-has-a-thumbsup-in-it |
this-header-has-a-thumbsup-in-it |
# header with 한글 characters (using unicode) |
header-with-한글-characters-using-unicode |
header-with-한글-characters-using-unicode |
header-with-한글-characters-using-unicode |
### Repeated Heading |
repeated-heading-1 |
repeated-heading-1 |
repeated-heading-1 |
## Repeated Heading |
repeated-heading-2 |
repeated-heading-2 |
repeated-heading-2 |
## Underscores foo_bar_, dots . and numbers 1.7e-3 |
underscores-foo_bar_-dots-and-numbers-1-7e-3 |
underscores-foo_bar_-dots--and-numbers-17e-3 |
underscores-foo_bar_-dots-and-numbers-17e-3 |
## Many spaces |
many-spaces |
many----------spaces |
many-spaces |
NB: GitLab doesn't appear to follow it's own GitLab flavoured markdown documentation. Specifically, it doesn't actually collapse shortcodes and 1.7 doesn't become 1-7 but 17.
The major difference appears to be that GitLab and Forgejo both collapse all non alphanumeric characters into a single dash whereas GitHub converts whitespace characters to dashes and ignores all others. However, GitLab appears to ignore the . in 1.7e-3 like GitHub whereas Forgejo doesn't.
mdBook seems to follow GitHub's fragment generation. I suggest we do the same as it is the most popular forge. The option remains to additional generators in the config in the future.
Perfect. Great analysis! Maybe you want to start a new crate (anchorfy?) which allows you to create the fragment from different flavors?
assert_eq!(anchorfy::Github("## Many spaces"), "many----------spaces".to_string());
assert_eq!(anchorfy::Gitlab("## Many spaces"), "many-spaces".to_string());
For now, GitHub/mdbook would be enough.
how would we handle these outside of the popular sites? should we provide a way for the user to configure this (probably the same way we pre-configure these websites)?
I'd assume >90% of all Markdown files we check are on GitHub. If we make it configurable, we should keep it simple. Maybe we just introduce a --fragment-format option at some point, which accepts github, gitlab, etc. as an argument. For the initial version, let's go with GitHub and wait for feature requests. Otherwise, the PR becomes too big. What do you think?
thats fair enough but another major use case would be linking to documentation, for example swagger
i think we should design the header format used in an independent way so that in the future we can let others set it easier
Sorry, I don't quite get it. If we find a fragment link in a document, we have to check if that fragment exists. Say we find "#helloworld" and there is a headline named "Hello World", for which a fragment was generated automatically by the hoster (GitHub, GitLab, Swagger (?)). Now the question is if that headline gets converted into "#helloworld" or "#hello-world" or something else. How can we design the header format in such a case? Don't we have to generate the same fragment link as the hoster?
It's likely I might be missing something here.
Don't we have to generate the same fragment link as the hoster
yes thats what i mean, the exact implementation would need a balance between verbosity and conciseness but for example something like
[ { " ": "-" } ]
this would replace in the content with - in the header, transforming hello world to hello-world, we'd likely need more configuration like escaping, casing, handling duplicates and so on
Ah, I understand now. Not sure how to find the right balance between simplicity and flexibility here. Open for suggestions for the configuration format.
we could use toml, i believe it supports arbitrary keys as string ("abc" = "xyz") which could be used for replacing characters
we could take an "enum" (not sure toml supports them) for casing, options i can think of are camel case, snake case, kebab case, pascal case and macro case
for escaping i think we already do percent encoding so not sure if theres additional configuration necessary
lastly duplicates are a little tricky, we could have them provide a format template such as {anchor}-{n} where n is the index of the occurrence, we'd have to also let them set if its 0 indexed or 1 indexed and whether the first occurrence still needs this formatting (whether to use {anchor} or {anchor}-{0} for the first occurrence)
or we could take an entirely different approach and direct users to use the crate for configuration, maybe having them pass a closure like |element| -> anchor, which is infinitely extensible but may be a dealbreaker for non-rust users or people that dont want to spend the time on that