mlc icon indicating copy to clipboard operation
mlc copied to clipboard

Support anchor link targets

Open becheran opened this issue 5 years ago • 61 comments

Anchor links

The part after the # called anchor link is currently not checked. A markdown link including an anchor link target is for example [anchor](#go-to-anchor-on-same-page), or [external](http://external.com#headline-1).

How do anchor links work

HTML defines anchors targets via the anchor name tag (e.g. <a id="generator"></a>). An anchor target can also be any html tag with an id attribute (e.g. <div id="fol-bar"></div>).

The official markdown spec does not define anchor targets. But most interpreters and renderer support the generation of default anchor targets in markdown documents. For example the github markdown flawor supports auto generated link targets for all headline (h1 to h6) to with the following rules:

  1. downcase the headline
  2. remove anything that is not a letter, number, space or hyphen
  3. changes any space to a hyphen

Implementation hints

A first good step would be to add valid anchor links example markdown files to the benches dir which will be used for the [end-to-end unit tests[(./tests/end_to_end.rs).

The library run method is the most important method which will use all submodules and does the actual execution.

In the link extractor module the part after the # needs to be extracted and saved in the MarkupLink struct.

The lilnk validator module is responsible for the actual resolving and check whether a resource exists (either on disk or as URL. This code needs to be enhanced to not only check for existence if an anchor link was extracted, but also actually parse the target file and extract all anchor targets. Same must be done for we links. Here a HEAD request is send right now and only of that failes a GET is send. If an achor link needs to be followed a GET request is needed and the resulting page needs to be parsed for all anchors.

Besides the already existing grouping of same links which are only checked once for performance boost, it would also make sense to parse an document wich contains an anchor to it only once and reuse the parse result for others references to the same doc, Also for performacne reasons it would be great to only download and parse documents which actually have an anchor link to them and not all docs for all links.

becheran avatar Apr 25 '21 19:04 becheran

@hoijui if you want to contribute and help improving mlc this would be awesome of course. Hope the tool actually does what you want it to do. I crated it with my (and others) GitHub READMES and docs in mind to eliminate broken links there.

cool! :-) My use case is the same; mainly for docs of open source hardware projects.

You are right.. no file-system caching of downloaded files is necessary. scanning for references once downloaded and cachign that in memory makes more sense... sounds all good that you write here.

PS: The link sin your issue do not work. Looks like you used relative links, and instead of them going to the soures, they apply directly to this URL (.../issues/). :/

Here the (absolute) working links:

  • https://github.com/becheran/mlc/tree/master/benches/benchmark
  • https://github.com/becheran/mlc/blob/master/tests/end_to_end.rs
  • https://github.com/becheran/mlc/blob/master/src/lib.rs
  • https://github.com/becheran/mlc/tree/master/src/link_extractors
  • https://github.com/becheran/mlc/tree/master/src/link_validator

Thanks a lot! great issue.. I hope I get to look into it these days (I think I will :-) ).

hoijui avatar Apr 26 '21 17:04 hoijui

A question ... in the benchmark directory, there are some markdown files (directly in that dir), and then there is a markdown and an html directory, with markdown and html files respectively. why are there markdown files in the first level? or say.. should I just put new files just under the markdown and html dirs?

hoijui avatar Apr 26 '21 18:04 hoijui

PS: The link sin your issue do not work. Looks like you used relative links, and instead of them going to the soures, they apply directly to this URL (.../issues/). :/

Here the (absolute) working links:

  • https://github.com/becheran/mlc/tree/master/benches/benchmark
  • https://github.com/becheran/mlc/blob/master/tests/end_to_end.rs
  • https://github.com/becheran/mlc/blob/master/src/lib.rs
  • https://github.com/becheran/mlc/tree/master/src/link_extractors
  • https://github.com/becheran/mlc/tree/master/src/link_validator

Thanks! Don't know what I thought when I created those links. Broken links in a link checker tool issue... 🤦‍♂️

becheran avatar Apr 26 '21 19:04 becheran

A question ... in the benchmark directory, there are some markdown files (directly in that dir), and then there is a markdown and an html directory, with markdown and html files respectively. why are there markdown files in the first level? or say.. should I just put new files just under the markdown and html dirs?

Good question, now that I look at those markdown files I wonder why I did not put them in their corresponding dirs in the first place? Would be great if you add tests for anchor links in markdown and html dir. I will cleanup here.

One more thing is that there is already a warn message in https://github.com/becheran/mlc/tree/master/src/link_validator/file_system.rs line 65 were I print a message that everything after # is ignored right now. So right now links with an anchor will not let mlc fail, but print an error log message instead.

becheran avatar Apr 26 '21 19:04 becheran

a question ...

You have code scanning Markdown files for links (src/markdown_link_extractor.rs) and code scanning HTML files for links (src/link_extractors/html_link_extractor.rs).

Now we need code scanning for anchor targets in Markdown and HTML too. Should this code be integrated into the link scanner code, or should it be separate? If it gets integrated, we can make the code so one can select to only scan for links, only for anchor targets, or for both. This way, we need to scan each file only once. If we separate the code, and scan only for one of the two during each scan, it will probably be less performant, and the code will probably be bigger, overall, and more complicated; right? So I suppose, integration is the way to go?

An other question: should we make the anchor checking optional? (I would say yes, because.. why not?)

hoijui avatar Apr 27 '21 12:04 hoijui

So I suppose, integration is the way to go?

Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions...

But in the end, both extractors do a push of a new link to the result vectors at one or several points. For example line 159 in markdown_link_extractor.rs. Instead of setting the target directly here it might make sense to call another static function which does the splitting of anchor and target and add this to the MarkupLink struct since this would be the same for the html and markup link extractor or not? Maybe add the anchor as another option field of the MarkupLink struct such as pub anchor: Option<String>?

The actual anchor check and file load/download should happen in the link_validtor somewhere I would say (without having a 100% clear strategy yet).

An other question: should we make the anchor checking optional? (I would say yes, because.. why not?)

Yes. Though, I would prefer it to be enabled as default and can be disabled via a command line arg such as --no-anchors.

becheran avatar Apr 27 '21 18:04 becheran

Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions...

Speaking of hard to read... I am trying to get rid of the a bit strange looking and not so performant throttle code in lib.rs on a sperate branch

becheran avatar Apr 27 '21 18:04 becheran

Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions...

Speaking of hard to read... I am trying to get rid of the a bit strange looking and not so performant throttle code in lib.rs on a sperate branch

ok.. I do not understand that, but.. good luck! :-) and.. does that mean, that right now I am able to work on the parts I need to change (mainly link_validator stuff), without causing many merge conflicts for you (or me) later?

hoijui avatar Apr 28 '21 08:04 hoijui

and.. does that mean, that right now I am able to work on the parts I need to change (mainly link_validator stuff), without causing many merge conflicts for you (or me) later?

You can work on that part right now. My changes should be far away in lib.rs and if we need to merge I will do that. No worries

becheran avatar Apr 28 '21 10:04 becheran

For example the github markdown flawor supports auto generated link targets for all headline (h1 to h6) to with the following rules:

  1. downcase the headline
  2. remove anything that is not a letter, number, space or hyphen
  3. changes any space to a hyphen

As you write, this is the case for GFM, but I just found out, that Pandoc Markdown does it differently. :/ It strips any number in the beginning of the title. This means, that for this anchor checking, we have to allow the user to choose which Markdown flavor they are using, and we have to support the different algorithms of these flavors, I guess.

hoijui avatar May 12 '21 09:05 hoijui

That's the problem of not having a md standard. Every implementation can do it differently :-(

I guess there is no way around having a command line option. But what would be your most relevant usecase? Personally I use mlc mainly for GitHub Readme checks so I would love to have GitHub style support.

Just re-visited issue #19 where someone states:

Yeah, this would be very nice. I use those with GitLab docs extensively. Sometimes, I even use custom anchors:

### This is a very long header {#long-header}

Let's [link to it](#long-header)

Never saw this {#long-header} syntax for headline anchors before. So this might be a third style only valid for GitLab markdown...

becheran avatar May 12 '21 17:05 becheran

@becheran It was me writing about the {#id} syntax. I use it when generating documents from Markdown using pandoc, the relevant docs are here, look for Heading identifiers. I might have been wrong about GitLab supporting it, sorry about that. Pandoc docs say the same syntax works with PHP Markdown extra, see here. So, it may be worth considering configuring support for this syntax not on a Markdown variant basis, but just with a switch enabling this extended support (imagine something like --enable-special-attributes). Just an idea, though.

dvtomas avatar May 12 '21 18:05 dvtomas

My main use cases are GitHub, GitLab and Pandoc too. I also use the {#id} syntax (with pandoc). It is quite useful for when you have documents of this structure:

## Foo

### Intro {#foo-intro}

## Bar

### Intro {#bar-intro}

Without that syntax, the anchors would just be enumerated (#intro and #intro-1, for example), which of course is cryptic and subject to change, when the document structure changes.

hoijui avatar May 12 '21 19:05 hoijui

.. I am not dead! still on it, although sporadically. today I did some stuff again, and for the first time, felt somewhat comfortable with using rust.. yeay! somehow I feel, it is partly because of your code. It is not "sophisticated" in a bad way... thanks ! :-) still no commits to see yet, but I am mostly done with the benchmark files (at least in Markdown), and I am about 30% through with a first, basic implementation in the code.

hoijui avatar May 31 '21 06:05 hoijui

@hoijui take all the time you want. There are no deadlines. We do this for fun. At least this is how I treat coding in my spare time. It should be at least as enjoyable as watching a Netflix series 😛

If you need/want any help, plz let me know. Rust rocks! 🦀

becheran avatar May 31 '21 20:05 becheran

:-) thank you @becheran ! It feels good to hear that.

And yes.. rust can be fun I learned now, cool! :-) I am a bit unsure how to go about the parser part. It needs to be extended with parsing for the anchors (those auto-generated from titles and pandoc-markdown style, manually set ones). When reading the parser part, it looks a bit.. very verbose.. is this really the best way to do it, comparing the chars to come like this:

            } else if line_chars.get(column) == Some(&'<')
                && line_chars.get(column + 1) == Some(&'/')
                && line_chars.get(column + 2) == Some(&'s')
                && line_chars.get(column + 3) == Some(&'c')
                && line_chars.get(column + 4) == Some(&'r')
                && line_chars.get(column + 5) == Some(&'i')
                && line_chars.get(column + 6) == Some(&'p')
                && line_chars.get(column + 7) == Some(&'t')
                && line_chars.get(column + 8) == Some(&'>')
            {

obviously, I am way too newb to decide, it just looks too verbose to me. .. maybe a helper function that takes line_chars and "</script>" or an other Vec<char> containing ""? or can this not be made as performant?

That is a minor thing though, I am not much concerned about it, but I am about this: I do not feel confident that I would.. manually edit this parsing code, and not produce lots of bugs (wrongly parsed edge-cases). How did you come up with this code? For example: The markdown parsing code seems to skip ATX-style titles. So links in titles are not recognized by mlc, but CommonMark supports them, and everything else I tried so far did too. That then makes me ask... should we use an external parser, a library for that? It seems that there are not many rust markdown parsers, and the seem to support only CommonMark and GFM, but I would also need pandoc markdown (for the manually set anchors). There is also a rust library interfacing with pandoc (the binary), to produce pandoc ASTs from rust, which can then be used in rust. Pandoc supports GFM, CommonMark, PandocMarkdown, HTML, reStrucutredText, and MANY more input formats! It just has two mayor downsides:

  • it is not rust, but we would be calling an external binary (written in Haskell)
  • Pandoc does not give source line and column of the parsed parts

The second issue there, makes it practically unusable. :/ ... except.. we venture into pandoc development, majorly! Coincidentally, I know some of the pandoc internal issues, which would make this a huge task: It is made up of many parts (obvious, when looking at all the languages it supports), that all interconnect to each other through its AST. Pandoc supports no versioning or other modularity meta-data for an AST, as of now, which means.. every change that would affect the AST, needs EVERYTHING to be modified. this includes all the parsing and serializing modules, and even much worse, all the filters written by everyone who uses pandoc. And this is by far the best document conversion tool out there (in terms of spread and functionality). There is basically no alternative to it.

... Yet an other approach would be, to use a general parsing library (lexer, PEG, LR, LL, ... I don't know what all that means ;-) ).

I did not check for HTML, but it being much more standardized, wider spread and so on, I bet there would be one or two good rust HTML parsers that could be used, though even editing your code for that looks easy enough for me to do it.

But yeah... I am a bit frustrated with the Markdown issue. Of course it all comes down in the end, to markdown not being one standard (plus pandocs internal issues).

... breathing again

hoijui avatar Jun 01 '21 06:06 hoijui

The links-in-title thing is something I came about purely by accident, and I know I would have missed much more if I had written the parser; I just mentioned it because ... unelegantly put, it made me feel my fear of this being complex is valid.

hoijui avatar Jun 01 '21 08:06 hoijui

I had some bad days. am better now... ignore the above! ;-)

hoijui avatar Jun 03 '21 06:06 hoijui

@hoijui sorry for late reply.

You are right. The parser part is nothing to be proud of. Was kind of the first rust code that I wrote and it looks pretty ugly. I am sorry. My thoughts when I started this was this:

  • I need to extract links from markdown with keeping the line and column
  • There is no lib out there which does the job
  • I came up with this if-cascade/match-case parser monster

Another issue with this code is that I did not knew the very well documented CommonMark and GitHub favor Markdown specs. Should have read them before. Anyways... too late and it does work, except for edge cases (aka. bugs) when it doesn't :-P.

becheran avatar Jun 03 '21 19:06 becheran

The links-in-title thing is something I came about purely by accident, and I know I would have missed much more if I had written the parser; I just mentioned it because ... unelegantly put, it made me feel my fear of this being complex is valid.

It is a bug? I wonder what I thought when I wrote the link_in_headline unit test?? Was it not possible the have links in headlines before in GitHub? Did they change the markdown parser? Can't reconstruct this anymore. Fact is that one can have links in headlines even on GitHub. Issue #35

becheran avatar Jun 03 '21 19:06 becheran

OK. And for the future of this issue...

The usage of pandoc seems unfortunately not an option due to the issue that you pointed out. Without the source file metadata the parser is not really practical here. :-(

So I assume there is no real way around writing an own parser? Just did a quick search and found this lib which looks as if it does a good job in extracting tokens. I still believe that extracting links and headlines from markdown flavors should be a solvable problem.

What do you thing @hoijui? Do you think it makes sense if I re-write the parser using the logos lexer?

becheran avatar Jun 03 '21 20:06 becheran

Hi all,

I recently wrote a very simple quick and dirty intra-links checking for my markdown documents. I'm not sure if it's of any utility for you, but just maybe.

Here's the code. The Itertools dependency is there just for the join method and can be easily get rid of.



use std::collections::HashSet;
use itertools::Itertools;

/// Checks that each `[Foo](#foo)` link has a corresponding `# Foo header {#foo}` header
fn check_markdown_anchor_links(document: &str) -> Result<(), String> {
    fn words<'a>(s: &'a str, starting_with: &str, ending_with: char) -> HashSet<&'a str> {
        s.split(starting_with).skip(1).filter_map(|s| s.split_once(ending_with).map(|s| s.0)).collect()
    }

    let anchors = words(document, "{#", '}');
    let links = words(document, "(#", ')');
    let links_without_anchors = links.difference(&anchors).map(|link| *link).collect_vec();
    if links_without_anchors.is_empty() {
        Ok(())
    } else {
        Err(format!("The following [](#) intra-document links don't have a corresponding {{#}} header anchor: {}", links_without_anchors.join(",")))
    }
}

/// Suppose that the file contains[Internal link to headers][] as per https://pandoc.org/MANUAL.html#extension-implicit_header_references.
/// Check that all such internal links refer to existing headers. See the test.
fn check_markdown_implicit_header_references(document: &str) -> Result<(), String> {
    let anchors: HashSet<&str> = {
        document.lines().filter_map(|l|
            if l.starts_with('#') {
                Some(l.trim_start_matches('#').trim())
            } else {
                None
            }
        ).collect()
    };

    let links: HashSet<&str> = {
        let ending = "][]";
        document
            .split_inclusive(ending)
            .filter_map(|s| if s.ends_with(ending) {
                s.trim_end_matches(ending).rsplit_once('[').map(|s| s.1)
            } else {
                None
            })
            .collect()
    };

    let links_without_anchors = links.difference(&anchors).map(|link| *link).collect_vec();
    if links_without_anchors.is_empty() {
        Ok(())
    } else {
        Err(format!("The following [...][] intra-document links don't have a corresponding # header: {}", links_without_anchors.join(",")))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_check_markdown_implicit_header_references() {
        let document = "\
# Header
This is a link to [Header][]
This is [Ignored, no trailing empty square brackets]
This is a [link to a non-existing anchor][]
Another [Ignored, this time at the end of the document]
        ";
        assert_eq!(
            &check_markdown_implicit_header_references(document).unwrap_err(),
            "The following [...][] intra-document links don't have a corresponding # header: link to a non-existing anchor"
        )
    }

    #[test]
    fn test_check_markdown_anchor_links() {
        let document = "\
# Header {#header}
This is a [link to the header](#header)
This is a [link to a non-existing anchor](#non-existing)
This is [link to a non-existing anchor](#non-existing)
        ";
        assert_eq!(
            &check_markdown_anchor_links(document).unwrap_err(),
            "The following [](#) intra-document links don't have a corresponding {#} header anchor: non-existing"
        )
    }
}

dvtomas avatar Jun 04 '21 05:06 dvtomas

:D :D wow.. thank you @dvtomas .. it's like magic, that you ended up here with this right now! :-) If I read the code correctly, you are not doing a structural parsing of the Markdown, but rather ... scan for certain strings. So for example, you would recognize links and titles even when they are within code blocks, right?

hoijui avatar Jun 04 '21 06:06 hoijui

OK. And for the future of this issue...

The usage of pandoc seems unfortunately not an option due to the issue that you pointed out. Without the source file metadata the parser is not really practical here. :-(

So I assume there is no real way around writing an own parser? Just did a quick search and found this lib which looks as if it does a good job in extracting tokens. I still believe that extracting links and headlines from markdown flavors should be a solvable problem.

What do you thing @hoijui? Do you think it makes sense if I re-write the parser using the logos lexer?

Sounds good! do you have time and muse for that? :-) I have no practical experience with anything regarding parsing/lexers and such... but if you think this one is good... good!

I guess I'll try to wrap-up what I have so far into commits, even though it is of course not functioning without the parsing part, and.. later retro-fit it to your changes... ?

hoijui avatar Jun 04 '21 06:06 hoijui

@hoijui No magic, I've been watching the issue because I'm also interested in that :-)

You're right about the code just looking for opening and closing sequences regardless of context. Adding support for code blocks shouldn't be too hard, though, I can imagine parsing the document line by line in a similar fashion my code does, and adding a simple state indicator inside_block that would be triggered on and off when triple backtick is encountered (and probably ignoring the line if it starts with a four-spaces indent, that's another way of expressing a code block). I'm not saying that it's the best solution, but it is a simple one. Tests on real-world documents would probably reveal if it is viable.

dvtomas avatar Jun 04 '21 06:06 dvtomas

@dvtomas lists also use indenting, and code withing lists and lists in lists may use more indents ... if you could get away with this approach, nobody would use the more complex mechanisms like a lexer. :/ Though if your solution works for you in your case, it is good of course!

My use case is pretty much the opposite: I need the tool to be very robust, as I want it to be used in CI jobs for all Open Source Hardware documentation projects (which are available as a git repo). Optimally, thousands. They will come up with every possible way to challenge this tool, I bet. :-)

hoijui avatar Jun 04 '21 08:06 hoijui

Sure, I'm all for a robust solution and looking forward it. It's just that you could devise a simple solution in like a day, and have an already useful utility covering an overwhelming majority of use-cases. If some broken link inside a deeply nested list isn't reported, that's of course bad, but not a tragedy. You could work towards the perfect solution meanwhile.

Btw, have you considered adapting (perhaps improving) some existing parser, e.g. https://crates.io/crates/markdown ? Sorry if this was already discussed, I haven't checked the whole discussion.

dvtomas avatar Jun 04 '21 08:06 dvtomas

.. .ahhhh good idea! https://github.com/johannhof/markdown.rs/blob/master/src/parser/mod.rs#L78 this method takes a string (MD document) and returns a Tree of parsed elements. That should work; and yes, in case it does not support custom header anchors, we could add it. Looking at the project issues though, it looks to be quite buggy still, and it does not mention following any markdown flavor, which means.. it would be quite some work too. Is there an other library? It looks like not, right?

hoijui avatar Jun 04 '21 09:06 hoijui

If we decide that this library is too buggy, I think it would make sense to make our own, separate of mlc, which just parses MD into an AST, and nothing more. ... and make sure the AST is versioned, or better: is annotated with arbitrary tags, not to fall into the same trap as pandoc.

hoijui avatar Jun 04 '21 09:06 hoijui

Wish you good luck with that :-) https://crates.io/search?page=2&q=markdown%20parser reveals a lot of abandoned projects so yes, I guess a robust markdown parser would be a good thing to have. Of course it is hard to decide whether it is easier to fix the markdown crate or create new one from scratch, perhaps the crate author could help you decide.

dvtomas avatar Jun 04 '21 10:06 dvtomas