lychee Support for AsciiDoc (.adoc) files

Would it be possible to add support for checking links in AsciiDoc (*.adoc) files? It seems that lychee is currently unable to correctly parse asciidoc-style URLs and reports e.g.

✗ https://www.kernel.org/doc/Documentation/fb/framebuffer.txt[framebuffer] (HTTP status client error (404 Not Found) for url (https://www.kernel.org/doc/Documentation/fb/framebuffer.txt[framebuffer]))

when obviously the URL it should be checking is https://www.kernel.org/doc/Documentation/fb/framebuffer.txt

The reason I ask is that we're switching our documentation repo from markdown-source to asciidoc-source, and it would be nice to be able to run https://github.com/marketplace/actions/lychee-broken-link-checker directly against the asciidoc files.

Aug 02 '21 09:08 lurch

there are a few differences between asciidoc and asciidoctor, even though the later claimed to be another implementation of the former.

We should support both.

Aug 04 '21 03:08 lebensterben

Are there any good crates for AsciiDoc yet? Couldn't find any.

Aug 20 '21 09:08 mre

we can pandoc it to html

Aug 20 '21 11:08 lebensterben

Smart. Would we have to wrap the pandoc commandline tool for that?

Aug 20 '21 12:08 mre

@mre

First make it a optional feature, say "pandoc: Adds support for more file formats with help of 'pandoc'".

If the feature is enabled, then also package a binary release, which would not violate pandoc's GPL v2 license. And use this library in our code https://lib.rs/crates/pandoc

If the feature isn't enabled, but the user provided some known file formats that are only supported by pandoc, it should recommend them to use the optional 'pandoc' feature.

Aug 20 '21 16:08 lebensterben

That's cool. Yeah let's do this.

Aug 20 '21 17:08 mre

How would we package the pandoc binary? In the Docker image it's pretty easy, but what about binary? Is there any build.rs file we could use as an example? I'd argue that most people would use the Docker image for that or install pandoc themselves. We could add a check if it's in the PATH on startup.

Jan 13 '22 00:01 mre

@mre You can for example write a shell script to grab the latest release from their repository.

Jan 13 '22 00:01 lebensterben

Just so we don't forget, there's also asciidoctor, which could be another binary to search for when trying to convert AsciiDoc. Here's an example for how to convert AsciiDoc to HTML: https://github.com/jeremyandrews/cio/blob/5c98cda6e31219e5063d591062086bbc370adbea/cio/src/rfds.rs#L95

Nov 10 '22 22:11 mre

I guess the downside of converting the AsciiDoc to HTML first, and then link-checking the HTML, is that any broken-link errors will be reported against the HTML output filenames, and not the original AsciiDoc input filenames? :thinking: :man_shrugging: This will especially be a problem if the AsciiDoc sourcefiles make extensive use of the include directive (which we use a lot in https://github.com/raspberrypi/documentation/ ) .

Nov 11 '22 00:11 lurch

I see. That's a problem indeed. However we could track files that got converted in the process and print the original filenames instead. (We don't support printing the location of broken links within files so it should be fine.) If include directives get handled by Asciidoctor when it converts to HTML then we should be good no?

Nov 11 '22 01:11 mre

If include directives get handled by Asciidoctor when it converts to HTML then we should be good no?

Consider if topfile.adoc includes file1.adoc and file2.adoc - when you convert to HTML this will get flattened into a single topfile.html. So if there was a broken link in file1.adoc, I guess your proposed solution would actually report it as being inside topfile.adoc, which wouldn't be right. (and obviously the more includes you use, see e.g. https://raw.githubusercontent.com/raspberrypi/documentation/develop/documentation/asciidoc/computers/configuration.adoc , the worse this problem would become)

So perhaps in hindsight, the original request was outside the scope of what lychee could/should be able to do? :shrug:

Nov 11 '22 01:11 lurch

You're right. That won't work. I would really love to support AsciiDoc, but there are no Rust parsers that I'm aware of. Link checking would be very rudimentary at best without that. 😞

Nov 11 '22 10:11 mre

You could just use some regexes? Probably no need to parse the whole thing "properly"?

Nov 11 '22 11:11 lurch

Honestly I'm scared of the edge-cases.

A regex-based solution might produce a lot of false-positives and these are worse than false-negatives in my experience: It's tedious for a user to filter out unwanted matches instead of missing a valid link.

Let me show you what I mean with an example.

To match links, the easiest case would be to start with a regular expression like

(.+)\[.*?\]

This would work fine in many situations like

https://chat.asciidoc.org[Discuss AsciiDoc,role=external,window=_blank]
https://chat.asciidoc.org[Discuss AsciiDoc^]
https://example.org/?q=%5Ba%20b%5D[URL with special characters]

but it would also match

[[bookmark-a]]Inline anchors make arbitrary content referenceable.

So one might think: "let's just ignore links with spaces"!

One could try

([^\s]+)\[.*?\]

This would work, but it would break in the following case:

link:++https://example.org/?q=[a b]++[URL with special characters]

Even if we could match on the last occurrence of square brackets before a string (which wouldn't be pretty), it would still incorrectly match includes:

include::filename.txt[tag=definition]

To handle this case we'd need negative lookaround, which is not supported by the regex crate and even if we did that with some post-processing ( s.trim_start_matches("include::"); it would not be very maintainable.

And these examples are just taken from the AsciiDoc quick reference. It's likely that there might be more edge-cases in real world examples.

Unless I'm missing something the best option might be to wait for a crate with proper parsing support or use an external tool and explicitly drop support for includes.

Nov 11 '22 13:11 mre

Honestly I'm scared of the edge-cases. A regex-based solution might produce a lot of false-positives and these are worse than false-negatives in my experience

Yeah, that's fair enough. Edge-cases always turn out to be a lot hairier than you expect :laughing:

If it makes any difference, I was only expecting lychee to check external links (e.g. those starting with http:// or https://) rather than also checking internal cross-references; but even that would probably fall foul of the edge-cases you've identified.

Feel free to close this issue if you think that it's out of scope for this project.

Nov 11 '22 14:11 lurch

Let's keep it open for the time being because I'd still love to support it. Maybe there will be an AsciiDoc crate in the future.

Nov 11 '22 15:11 mre