elm-markdown
elm-markdown copied to clipboard
Parse GFM Extended Autolinks
We currently handle the CommonMark autolinks, which are links with explicit surrounding <>'s.
However, we are not parsing the GitHub-Flavored Markdown's extended autolinks, which are bare links with no explicit token. The fact that it should be parsed as a link is inferred by the format, for example content starting with https:// and followed by a valid domain.
You can see the current end-to-end spec failures here:
https://github.com/dillonkearns/elm-markdown/blob/master/test-results/failing/GFM/%5Bextension%5D%20Autolinks.md
This issue will be complete when we've made those end-to-end tests pass.
Existing Inline Parsing Code
Note that the inline parsing code does not using elm/parser because Markdown inline parsing using a very different algorithm than the block parsing, and it's not well-suited to elm/parser. The details of why are not important in this issue, but it's worth being aware that this code is based on Regex processing.
Here's the current area where CommonMark-style autolinks are handled:
https://github.com/dillonkearns/elm-markdown/blob/40f9dc4d1e8cfd9d27bc2be97447adf752ef052c/src/Markdown/InlineParser.elm#L1101-L1107
Note that it is only applying this in the context of angleBracketsToMatch. We can likely reuse some of the autolinkToMatch code, but outside of the context of an angle brackets match.
@dillonkearns I'll work on this one if it's up for grabs!
It's all yours, thank you @stephenreddek! 👌 💯
@dillonkearns What are your thoughts on how to handle multiple trailing "entity references" per https://github.github.com/gfm/#example-626 ? It only explicitly mentions handling a single, trailing reference, but it sure feels like it should remove multiple of them if they exist.
One piece of supporting evidence for the idea of trimming all: the parentheses rule removes all trailing unmatched parentheses.
Another question! The spec for url autolinks only mentions support he protocols http and https, but there's a test that also shows supporting ftp. It's easy enough to add ftp specifically, but I'm unsure how to handle this seemingly conflicting information. Should it support only those 2 or 3? Should it support anything that looks like a protocol?
Thanks for any guidance you have!
Hey @stephenreddek!
Good questions. So for the URL schemes, my thinking is that it should either 1) be very specific (only the explicit ones mentioned, http and https), or 2) be completely general (anything in the form of scheme://).
I don't like the idea of hardcoding a specific set when there are so many possible schemes: https://en.wikipedia.org/wiki/List_of_URI_schemes. And indeed, many different possible valid URLs.
Babelmark tends to treat general schemes, like a slack:// link, as plain text (not autolinks):
https://babelmark.github.io/?text=This+is+a+slack+https+link%3A+https%3A%2F%2Fslack.com%2Fapp_redirect%3Fapp%3DA1BES823B%0A%0AThis+is+a+direct+slack+link%3A+slack%3A%2F%2Fopen%3Fteam%3Dmy-team%0A
So let's go with option (1) on this, and only handle the specific cases of http and https 👍
Regarding trailing entity references, that seems right to me that we should remove multiple references. What happens on babelmark? I often let that by the tie breaker when I'm not sure, with a little extra weight given to the results from the official C implementation for the GitHub Flavored Markdown engine's results.
Yep, the official implementation drops them all so I'll just go with that!