lychee Add support for anchor tags

Links that point to non existent anchors https://github.com/lenisha/linkchecker-test/blob/main/about.html#L92 are not detected

<a href="https://github.com/Microsoft/onnxruntime#api-documentation" target="_blank" class="link">

Mar 16 '21 19:03 lenisha

At this point, they're not supported.

This is a bit of a tricky one. Anchors are really the properties of the document contents, not really being integral parts of the URLs.

What I really mean by that is this:

You can't tell if an anchor is valid by merely making a request (you can only check the URL path without the anchor)
To verify anchors, you'd have to parse the retrieved HTML and check they're there (but we already do it for link extraction using html5ever, a browser grade parser)

So this might not be very straightforward, but should be perfectly doable. I'd welcome such a feature :smiley:

Mar 16 '21 19:03 pawroman

To verify anchors, you'd have to parse the retrieved HTML and check they're there (but we already do it for link extraction using html5ever, a browser grade parser)

HTMLs are parsed for link extraction on the input HTML documents, but not for the target HTML documents. So it would required to first check whether the target is actually an HTML document, then download it fully (currently it is only partially downloaded or not at all when using -X head) and then parse the target HTML for a matching id attribute.

When support for internal links via file existence check is added (#21), then it could be at least cheaply added for those:

If the link/URL is an internal one, check only if the target file exists, relative to the input HTML file location or relative to the base path, when given and it is an absolute internal link.
Then, if the target is an HTML file, or a directory that contains an index.htm/index.html, check for a contained matching id attribute.

In other cases, I think it's difficult if one does not want to imply much overhead for checking the URLs document type and downloading it completely. But nevertheless, the idea/feature itself would be awesome of course 🙂.

Jun 11 '21 12:06 MichaIng

For reference the hint by @polarathene: https://github.com/lycheeverse/lychee/issues/21#issuecomment-859969005

Another case where the difficulties with target downloads etc do not apply, and probably this is what you had in mind anyway, is to check for scroll links within the input HTMLs. so if there is an <a href="#heading">, to check whether there is an id="heading" as well. If such is wanted, it's good to create a generic method/function for it, so that URL fragments can be generally detected and HTMLs generally scanned for matching IDs, but at first implemented for such same-page scroll links, then depending on #21 for locally stored target HTML files, and gradually for online HTML documents, when this is found to be important enough to fully download the target, based on content-type: text/html or so. But I'm not sure if the content type header is reliable enough of if there is a risk that a large archive or video file is unintentionally downloaded 😄.

Jun 12 '21 09:06 MichaIng

But I'm not sure if the content type header is reliable enough of if there is a risk that a large archive or video file is unintentionally downloaded

Response headers should provide info about content size, completing a request in full (eg downloading a large resource) shouldn't be necessary and could be terminated early AFAIK?

Jun 12 '21 12:06 polarathene

completing a request in full (eg downloading a large resource) shouldn't be necessary and could be terminated early AFAIK?

If you want to check whether an URL fragment is valid or not, you need to download the whole document of course, which is the whole issue I see here, as long as it's not about same-page links or local files. Currently the download is terminated early, reasonably, so currently it is impossible to apply this feature for remote HTML documents without changing the download behaviour based on such things like HTTP headers, which is why I think one needs to evaluate carefully if the effort is worth it or not 😉.

Jun 12 '21 12:06 MichaIng

you need to download the whole document of course, which is the whole issue I see here

Downloading a complete HTML document to operate on isn't much of a concern vs larger files which I thought was being raised as the concern but could be avoided.

You make a good point though, especially with some HTML documents that are dependent on JS to populate the real content, such as React pages without any pre-rendering process involved. That'd require actually loading and running the JS..

I haven't used lychee myself yet, so I'm not sure if it supports that well (anchor tag feature aside).

Jun 12 '21 13:06 polarathene

If other attributes of 'a' tag is provided, you can use that to help you determine whether to download that webpage, or whether it links to webpage, etc. But it gets complicated.

https://html.spec.whatwg.org/multipage/links.html#links

Jun 12 '21 13:06 lebensterben

Anchor tags are supported by https://github.com/killercup/link-checker and the implementation is done with a HashSet of anchor links per page. https://github.com/killercup/link-checker/blob/master/src/links.rs. Something to look into if we get around to adding it to lychee.

Mar 28 '22 10:03 mre

Sometimes anchors aren’t even in the HTML itself, like for static websites! Another approach could be to make the request with and without the anchor and compare the scroll location if that’s possible, it means making two requests though

Sep 26 '22 22:09 laralove143

make the request with and without the anchor and compare the scroll location if that’s possible

that requires emulation of webpage rendering and is outside the scope of this project.

Sep 26 '22 22:09 lebensterben

Also dynamic websites send an HTML document with element IDs, or do you mean scrolling via JavaScript? Usually JavaScript is just used to better control smooth scrolling, in addition to the still present functional element IDs for clients with no JavaScript support.

If there is really no matching ID, then I agree that the required full browser features, including JavaScript engine etc are out of scope. Also I can't imagine that anyone wants to wait 5-10 seconds two times for every checked link, required to load all assets, render the page, execute JavaScript, finish all events etc. Furthermore, since many websites have no definit "all done", but load/show additional dynamic content (ads, cookie banners, notification banners, ...), random seconds after the relevant content has been rendered already, including content shift and changing scroll position, it is practically impossible to do such comparison reliably without a lot of false negatives and without giving a very long grace period (30 seconds?) to minimise the risk of still expanding content.

Sep 27 '22 14:09 MichaIng

it might be worth splitting up the scope here. For my use case (a technical specification), I have a bunch of local .md files. I'd like to see that the links within those files are valid: they go to a valid file (already checked), and to a valid target anchor among the spec files. Also, I'd like to see that the http/s urls at least go to a valid place. As noted it could be challenging to ensure an anchor is valid on a dynamic site. I'm less interested in whether the external HTML files have the right anchor— those are likely to move around and not everyone tries to keep stable links.

Perhaps a way forward would be to have some way to determine which filetypes and schemes are subject to anchor checking.

In my case: filetype *.md and scheme file: only.

Other users, such as the original author, may make it more broad such as including http: and https:

Apr 04 '23 15:04 srl295

I've created https://github.com/lycheeverse/lychee/pull/1126 with an implementation of anchor/fragment checking for links to markdown files. Do comment with your thoughts. Is it roughly what you would like?

Jun 27 '23 14:06 HU90m

yes seems solving this as a whole isnt so simple but this solves a part of it

how does it determine if a link follows the md format? checking the pr it seems to check if the link ends with .md

Jun 28 '23 21:06 laralove143

Here is a rather verbose message with my thoughts.

For fragment checking, we are only concerned with where links are pointing to and there are five forms that come to mind:

file:///foo/bar.md#frag
file:///foo/bar.html#frag
https://foo/bar.html#frag
https://foo/bar#frag
https://foo/bar.md#frag

There are two file types markdown and HTML that it sounds like people are interested in checking the fragments within. It would make sense to use the extension in the link for determine the file type, i.e. .md and .html. The exception is Case 4 where one could assume it is a HTML file because it is being accessed through HTTP.

These files can be accessed through the file system (file:) or HTTP (http:/https:).

I would like lychee to at least support Cases 1, 2 and 3, with separate options markdown and html fragment checking. I don't have a strong opinion about Cases 4 and 5 at the moment.

I think HTML fragment checks should be opt-in, e.g. --include-html-fragments, due to false negatives on dynamic sites. I feel markdown fragment checks should be opt-out, e.g. --ignore-md-fragments.

Another decision that needs to be made is how we generate fragments from a markdown file. We could choose one method that we feel is the normal method or we could offer multiple methods. If we are offering multiple methods, we would have to make sure the configuration isn't overwhelming.

The most common explicit generation method appears to be heading attributes. These aren't implemented in the CommonMark spec, but are supported by pulldown-cmark and are used commonly, in python-markdown for example.

# Heading Attributes
## A Title {#a_title}
The above heading will be given `a_title` as an id.

The most common automatic generation method seems to be what I call unique kebab case, where headings are kebab cased (lower-cased with whitespaces are replaced by -). When two headings produce the same kebab case fragment, a number is appended to make it unique. This is used by GitHub, GitLab and mdBook.

# Unique Kebab Case
## The Title
The above heading will have `the-title` as an id.

## A Title
The above heading will have `a-title-1` as an id.

## A Title
The above heading will have `a-title-2` as an id.

If we want to provide a simple interface with only one generation approach, I think we should either use heading attributes and fall back to unique kebab case when not present, or use both unique kebab case and heading attributes (when present).

Note selectors spec:

If an element has multiple ID attributes, all of them must be treated as IDs for that element for the purposes of the ID selector.

Summary of the Open Questions

Which of the 5 link types we should support fragment checks for?
Whether they should be opt-in or opt-out?
Which method(s) of generating fragments from markdown should we implement? Should we support multiple? If so, how do we expose these options to the user?

What has been done so far?

In https://github.com/lycheeverse/lychee/pull/1126, I have just implemented checks for the case 1 links, i.e. file:///foo/bar.md#frag links pointing to files with .md extensions.

It is missing testing and not ready to be merged. It currently offers multiple different fragment generators and doesn't support in-line html within the markdown file.

Next Steps

To get this first PR in we need to answer questions 2 and 3. After this has been merged, I can start working on getting the additional link types (that we later agree on) working.

Jun 29 '23 17:06 HU90m

how does it determine if a link follows the md format? checking the pr it seems to check if the link ends with .md

Yep, it just looks at the extension.

Jun 29 '23 19:06 HU90m

Thanks for the summary.

From my perspective, the answers to your open questions would be as follows:

Which of the 5 link types we should support fragment checks for?

I agree that we should tackle 1-3. The other two cases seem to be outliers, which we could tackle later if needed.

Whether they should be opt-in or opt-out?

If it turns out that fragment checking is robust enough, I would not mind enabling it by default. By "robust enough" I mean that the number of false-positives is below a threshold of, say, 1% of all runs reporting a broken fragment, which in fact is working. I would not mind false-negatives too much. So I'd say, let's stay conservative and not report an error if we're uncertain. As for the flags, I'd vote for --include-fragments or --exclude-fragements, as this is the naming scheme for other inclusions/exclusions right now and follows the principle of least surprise. Whether we want to differentiate between e.g. --include-md-fragments and --include-html-fragments can be answered later. Ideally, I'd prefer a single option, but if one check is much more robust than the other, it might merit a separate flag.

Which method(s) of generating fragments from markdown should we implement? Should we support multiple? If so, how do we expose these options to the user?

I'd keep it simple for now. We can always add another option later to specify the detection algorithm, e.g. --fragment-generation kebap-case.

If we want to provide a simple interface with only one generation approach, I think we should either use heading attributes and fall back to unique kebab case when not present, or use both unique kebab case and heading attributes (when present).

Yes, fully agree. A combination of heading attributes and unique kebab case would work. We can use the examples from the Gitlab docs for our tests.

Jun 30 '23 12:06 mre

i think checking html files should be higher priority because majority of the time it'll be case 4, though im not sure how we'd do this, we'd probably need to parse the html and check for <hn> tags

i agree this should be opt-in because many websites dont support anchors, or they use anchors differently

another, possibly less efficient approach, would be to compare the response with and without the anchor to see if they're the same, this'd definitely be opt-in

this though isn't enough for websites/browsers that use anchors to scroll, which many of them do so i think that'd require users to provide a browser api, then we'd check if it scrolled with the anchor or not

Jun 30 '23 17:06 laralove143

i think checking html files should be higher priority because majority of the time it'll be case 4

Yes, websites without a file extension are indeed a common use-case.

though im not sure how we'd do this, we'd probably need to parse the html and check for <hn> tags

We already parse HTML, and we have access to all tags, thanks to html5ever and html5gum, so fetching the <hn> won't be an issue.

i agree this should be opt-in because many websites dont support anchors, or they use anchors differently

AFAIK, anchors are mainly a browser feature. They are not even sent to the website and are only handled client-side. You can confirm that by sending a request with an anchor tag to any website; it will be stripped out before the request.

Compare the URL in the address bar with the URL in the network tab:

JavaScript might also handle anchors, as they have access to the address bar. This is mainly used for dynamic websites I guess.

No browser API is needed, as it's enough to know that the anchor exists on the website that the browser tries to render. We don't have to emulate the actual rendering. Just the anchor id generation.

There's a chance I misunderstood your point.

Jun 30 '23 21:06 mre

i meant that for mainly spas or ssr, some websites use anchors kind of like paths or queries, for example example.com#anchor is like example.com/anchor i think this is dome to keep state of what "page" the user is in, so they can show that on reload or going back etc

these sites are a minority tho so we could skip over this if its too much work (probably is)

Jun 30 '23 22:06 laralove143

Ah, you got it. Thanks for the explanation. Yes, SPAs will be tricky for now. We might not be able to check these. Let's start with the simple use-case of static / server side rendered pages, which will already be a great step forward.

Jul 01 '23 11:07 mre

I was wrong about GitLab and GitHub using the same fragment generation. I've created some test cases to compare common forges:

Heading	Forgejo	GitHub	GitLab
`# Repeated Heading`	`repeated-heading`	`repeated-heading`	`repeated-heading`
`## This header has a :thumbsup: in it`	`this-header-has-a-thumbsup-in-it`	`this-header-has-a-thumbsup-in-it`	`this-header-has-a-thumbsup-in-it`
`# header with 한글 characters (using unicode)`	`header-with-한글-characters-using-unicode`	`header-with-한글-characters-using-unicode`	`header-with-한글-characters-using-unicode`
`### Repeated Heading`	`repeated-heading-1`	`repeated-heading-1`	`repeated-heading-1`
`## Repeated Heading`	`repeated-heading-2`	`repeated-heading-2`	`repeated-heading-2`
`## Underscores foo_bar_, dots . and numbers 1.7e-3`	`underscores-foo_bar_-dots-and-numbers-1-7e-3`	`underscores-foo_bar_-dots--and-numbers-17e-3`	`underscores-foo_bar_-dots-and-numbers-17e-3`
`## Many spaces`	`many-spaces`	`many----------spaces`	`many-spaces`

NB: GitLab doesn't appear to follow it's own GitLab flavoured markdown documentation. Specifically, it doesn't actually collapse shortcodes and 1.7 doesn't become 1-7 but 17.

The major difference appears to be that GitLab and Forgejo both collapse all non alphanumeric characters into a single dash whereas GitHub converts whitespace characters to dashes and ignores all others. However, GitLab appears to ignore the . in 1.7e-3 like GitHub whereas Forgejo doesn't.

mdBook seems to follow GitHub's fragment generation. I suggest we do the same as it is the most popular forge. The option remains to additional generators in the config in the future.

Jul 27 '23 11:07 HU90m

Perfect. Great analysis! Maybe you want to start a new crate (anchorfy?) which allows you to create the fragment from different flavors?

assert_eq!(anchorfy::Github("## Many          spaces"), "many----------spaces".to_string());
assert_eq!(anchorfy::Gitlab("## Many          spaces"), "many-spaces".to_string());

For now, GitHub/mdbook would be enough.

Jul 27 '23 12:07 mre

how would we handle these outside of the popular sites? should we provide a way for the user to configure this (probably the same way we pre-configure these websites)?

Jul 28 '23 00:07 laralove143

I'd assume >90% of all Markdown files we check are on GitHub. If we make it configurable, we should keep it simple. Maybe we just introduce a --fragment-format option at some point, which accepts github, gitlab, etc. as an argument. For the initial version, let's go with GitHub and wait for feature requests. Otherwise, the PR becomes too big. What do you think?

Jul 28 '23 08:07 mre

thats fair enough but another major use case would be linking to documentation, for example swagger

i think we should design the header format used in an independent way so that in the future we can let others set it easier

Jul 28 '23 15:07 laralove143

Sorry, I don't quite get it. If we find a fragment link in a document, we have to check if that fragment exists. Say we find "#helloworld" and there is a headline named "Hello World", for which a fragment was generated automatically by the hoster (GitHub, GitLab, Swagger (?)). Now the question is if that headline gets converted into "#helloworld" or "#hello-world" or something else. How can we design the header format in such a case? Don't we have to generate the same fragment link as the hoster?

It's likely I might be missing something here.

Jul 28 '23 20:07 mre

Don't we have to generate the same fragment link as the hoster

yes thats what i mean, the exact implementation would need a balance between verbosity and conciseness but for example something like

[ { " ": "-" } ]

this would replace in the content with - in the header, transforming hello world to hello-world, we'd likely need more configuration like escaping, casing, handling duplicates and so on

Jul 28 '23 20:07 laralove143

Ah, I understand now. Not sure how to find the right balance between simplicity and flexibility here. Open for suggestions for the configuration format.

Jul 29 '23 08:07 mre

we could use toml, i believe it supports arbitrary keys as string ("abc" = "xyz") which could be used for replacing characters

we could take an "enum" (not sure toml supports them) for casing, options i can think of are camel case, snake case, kebab case, pascal case and macro case

for escaping i think we already do percent encoding so not sure if theres additional configuration necessary

lastly duplicates are a little tricky, we could have them provide a format template such as {anchor}-{n} where n is the index of the occurrence, we'd have to also let them set if its 0 indexed or 1 indexed and whether the first occurrence still needs this formatting (whether to use {anchor} or {anchor}-{0} for the first occurrence)

or we could take an entirely different approach and direct users to use the crate for configuration, maybe having them pass a closure like |element| -> anchor, which is infinitely extensible but may be a dealbreaker for non-rust users or people that dont want to spend the time on that

Jul 29 '23 18:07 laralove143