elm-markdown icon indicating copy to clipboard operation
elm-markdown copied to clipboard

Parse GFM Extended Autolinks

Open dillonkearns opened this issue 5 years ago • 8 comments

We currently handle the CommonMark autolinks, which are links with explicit surrounding <>'s.

However, we are not parsing the GitHub-Flavored Markdown's extended autolinks, which are bare links with no explicit token. The fact that it should be parsed as a link is inferred by the format, for example content starting with https:// and followed by a valid domain.

You can see the current end-to-end spec failures here:

https://github.com/dillonkearns/elm-markdown/blob/master/test-results/failing/GFM/%5Bextension%5D%20Autolinks.md

This issue will be complete when we've made those end-to-end tests pass.

Existing Inline Parsing Code

Note that the inline parsing code does not using elm/parser because Markdown inline parsing using a very different algorithm than the block parsing, and it's not well-suited to elm/parser. The details of why are not important in this issue, but it's worth being aware that this code is based on Regex processing.

Here's the current area where CommonMark-style autolinks are handled:

https://github.com/dillonkearns/elm-markdown/blob/40f9dc4d1e8cfd9d27bc2be97447adf752ef052c/src/Markdown/InlineParser.elm#L1101-L1107

Note that it is only applying this in the context of angleBracketsToMatch. We can likely reuse some of the autolinkToMatch code, but outside of the context of an angle brackets match.

dillonkearns avatar Oct 02 '20 18:10 dillonkearns

@dillonkearns I'll work on this one if it's up for grabs!

stephenreddek avatar Oct 20 '20 20:10 stephenreddek

It's all yours, thank you @stephenreddek! 👌 💯

dillonkearns avatar Oct 20 '20 20:10 dillonkearns

@dillonkearns What are your thoughts on how to handle multiple trailing "entity references" per https://github.github.com/gfm/#example-626 ? It only explicitly mentions handling a single, trailing reference, but it sure feels like it should remove multiple of them if they exist.

stephenreddek avatar Oct 21 '20 15:10 stephenreddek

One piece of supporting evidence for the idea of trimming all: the parentheses rule removes all trailing unmatched parentheses.

stephenreddek avatar Oct 21 '20 15:10 stephenreddek

Another question! The spec for url autolinks only mentions support he protocols http and https, but there's a test that also shows supporting ftp. It's easy enough to add ftp specifically, but I'm unsure how to handle this seemingly conflicting information. Should it support only those 2 or 3? Should it support anything that looks like a protocol?

Thanks for any guidance you have!

stephenreddek avatar Oct 21 '20 16:10 stephenreddek

Hey @stephenreddek!

Good questions. So for the URL schemes, my thinking is that it should either 1) be very specific (only the explicit ones mentioned, http and https), or 2) be completely general (anything in the form of scheme://).

I don't like the idea of hardcoding a specific set when there are so many possible schemes: https://en.wikipedia.org/wiki/List_of_URI_schemes. And indeed, many different possible valid URLs.

Babelmark tends to treat general schemes, like a slack:// link, as plain text (not autolinks):

https://babelmark.github.io/?text=This+is+a+slack+https+link%3A+https%3A%2F%2Fslack.com%2Fapp_redirect%3Fapp%3DA1BES823B%0A%0AThis+is+a+direct+slack+link%3A+slack%3A%2F%2Fopen%3Fteam%3Dmy-team%0A

So let's go with option (1) on this, and only handle the specific cases of http and https 👍

dillonkearns avatar Oct 28 '20 18:10 dillonkearns

Regarding trailing entity references, that seems right to me that we should remove multiple references. What happens on babelmark? I often let that by the tie breaker when I'm not sure, with a little extra weight given to the results from the official C implementation for the GitHub Flavored Markdown engine's results.

dillonkearns avatar Oct 28 '20 18:10 dillonkearns

Yep, the official implementation drops them all so I'll just go with that!

stephenreddek avatar Oct 30 '20 22:10 stephenreddek