CommonMark.jl
CommonMark.jl copied to clipboard
Keep reference links as references
From what I gather from
https://github.com/MichaelHatherly/CommonMark.jl/blob/287e39ae128e80a6647d2d9762bf22040adffea1/src/parsers/inlines/links.jl#L158-L169
is that link labels get expanded to full links during parsing. Would it be possible / reasonable to somehow have a parser option that would keep the link labels and link reference definitions intact in the AST?
The use case I have in mind is to write a small script to automatically verify and possibly update such reference links and their definitions (specifically in the Documenter CHANGELOG).
Yes, that's pretty much how it works. It's definitely something that should really be fixed since the expectation is that markdown output from the AST is roundtripable, technically true by replacing the reference links with inline links as it currently is, but would be much nicer if reference links remained as is in the output for markdown. I think it was likely a carryover from the python library that this was originally based off of.
Would it be possible / reasonable to somehow have a parser option that would keep the link labels and link reference definitions intact in the AST?
I'd prefer if it was just the only behaviour rather than adding in options. I wouldn't class the change as breaking since I don't think it would be very likely for users to actually be depending on the current behaviour.
I am completely behind having this be the default behaviour. I think it would be more intuitive that way.
I guess for an implementation, there would need to be a new AbstractBlock
for the link reference definitions. And then either Link
would need to be expanded so that it could represent link labels (apparently, there are three different kinds: [foo][bar]
, [foo][]
and [foo]
) or there could be a new AbstractInline
for this.
On a slight tangent: how roundtrippable do you intend it to be? For example, I noticed that lists have some default indentation and use -
, so lists using *
do not get written back out the same way. Would you also imagine that ideally that type of formatting meta information be stored in the AST so that it could be replicated?
I guess for an implementation, there would need to be a new AbstractBlock for the link reference definitions. And then either Link would need to be expanded so that it could represent link labels (apparently, there are three different kinds: [foo][bar], [foo][] and [foo]) or there could be a new AbstractInline for this.
Probably a new block type and a new inline type would be a reasonable way to go so we can avoid manually branching in the show
methods, which may be slightly cleaner implementation-wise. (It'll probably need some iterations to work out the nicest approach.)
On a slight tangent: how roundtrippable do you intend it to be? For example, I noticed that lists have some default indentation and use -, so lists using * do not get written back out the same way. Would you also imagine that ideally that type of formatting meta information be stored in the AST so that it could be replicated?
Roundtrippable in the sense that the markdown->markdown pipeline acts as a strict code formatter without too many (or any user options, gofmt
comes to mind), so indentation gets normalised, math gets written with backticks, lists use a standard char, etc.
I have been looking at this a little bit, and I noticed that a reference link without a valid definition for the link label gets interpreted as a simple string. E.g.
[foo][bar]
would not become a Link
node, because [bar]: ...
is missing.
Now, as far as I can tell, this is completely in line with the spec (it doesn't seem to say it explicitly, but it can be inferred from the examples with problematic link labels; and the spec does say that a reference link needs to have a valid label). But it is a bit of a problem for my use case as I actually want to find such links. My hope is that I could just run the parser and then loop over the AST, finding all the (reference) links.
I can think of two options:
-
Allow for an option to parse reference links even if they do not have a valid labels. This, of course, goes against the spec and introduces an option to the parser, both of which would be good to avoid.
-
Allow the user to hook into the parser to capture the invalid links. I think allowing the user to somehow access the
else
branch of thisif
should be sufficient, e.g. with a callback function.https://github.com/MichaelHatherly/CommonMark.jl/blob/287e39ae128e80a6647d2d9762bf22040adffea1/src/parsers/inlines/links.jl#L161-L164
(2) seems better, as that doesn't touch the parsing behavior. And for my use case, it should also be sufficient: after catching all the invalid links in one parsing round, I could update the .refmap
(not sure if it gets reset though) and then re-run the parser.
Yeah, 2 looks like a better option perhaps, might be nice to have a general mechanism for hooking into different places during parsing for this kind of stuff. Any idea whether there's other MD parsers that allow these kind of "hooks"?
I was thinking about how to properly implement finding these invalid reference links. My original implementation was to add another field to InlineParser
that would contain the fallback, which the user could set with p.inline_parser.reflink_callback = ...
. However, that feels a little ugly.
So instead I was thinking that maybe customizing LinkRule
(and ImageRule
too then I guess) would be a better option. In fact, I think it would even be possible to implement this as an external user-defined rule, by just copying and modifying the functions in links.jl
. However, I am a little uncertain as to how stable the internal parsing logic is, so I am not sure I would want to copy the relatively long parsing code out of CommonMark for this, and so modifying things here might still be preferred.
With all that in mind:
- I am not sure if there's precedent for this, but would it be fine for the rule
struct
s likeLinkRule
to have fields that customize their behaviour? - If yes, then it would be necessary to access the actual rule object in the internal functions. However, I am not sure what would the best way to pass it through the
Rule
calls. Closures for theparser_*
functions ought to work though. I also thought of findingisa LinkRule
fromp.rules
, but theParser
object actually does not get passed (onlyInlineParser
). - I noticed that if you
enable!
the same rule multiple times, it just gets appended top.rules
. Should theenable!
function maybe check if a particular rule is already enabled and no-op then? I am not quite sure what effect it has if you have e.g. twoLinkRule
senable!
d, especially if they do slightly different things.
By the way, I did look through the docs of a few Markdown parsers, but I did not find any that would expose something like this.