commonmark-spec Consider ignoring HTML (inline) in `alt` on image

Consider ignoring HTML (inline) in `alt` on image

Open wooorm opened this issue 2 years ago • 15 comments

CommonMark prescribes that markdown is interpreted, but corresponding tags not output, in alt on <img>:

![foo *bar*]

[foo *bar*]: train.jpg "train & tracks"

<p><img src="train.jpg" alt="foo bar" title="train &amp; tracks" /></p>

Example 571

(see also some more info in that section).

To paint a more illustrative picture of this, and introduce the problem:

![a *b* `c` [d](#) <e>f</e> &amp; \- **g**](#)

<p><img src="#" alt="a b c d <e>f</e> &amp; - g" /></p>

I see no good reason that actual HTML is used, while html-from-markdown is ignored. I find that there is something to say for not doing this at all: for a *b* c, maybe the user actually wanted the asterisks in the alt. However, that’s probably too much of a breaking change, and maybe this is fine. And there is something to say for doing it for everything (including actual HTML), that a <em>b</em> c is consistent to a *b* c and turns into a b c. I believe this to be the right call, and hence this issue.

Perhaps of note: HTML does not work in alt. Neither tags, nor comments, nor instructions, nothing. https://html.spec.whatwg.org/multipage/parsing.html#attribute-value-(double-quoted)-state.

I can do the work.

Jun 28 '22 14:06 wooorm

Note: the spec doesn't mandate any particular treatment of the "image description." It only recommends:

Though this spec is concerned with parsing, not rendering, it is recommended that in rendering to HTML, only the plain string content of the image description be used. Note that in the above example, the alt attribute’s value is foo bar, not foo [bar](/url) or foo <a href="/url">bar</a>. Only the plain string content is rendered, without formatting.

So a fully compliant implementation is free to make a different decision.

I see no good reason that actual HTML is used, while html-from-markdown is ignored.

cmark, the reference implementation, renders your example thus:

<p><img src="#" alt="a b c d &lt;e&gt;f&lt;/e&gt; &amp; - g" /></p>

I'm not sure what implementation preserved the <e>? I'd say it's a bug -- certainly an undesirable feature, and something to report to the relevant implementation.

Jul 04 '22 11:07 jgm

I'm not sure what implementation preserved the <e>...

Everyone that follows CommonMark: < and > don’t need to be encoded in double quoted attributes in HTML. They can be, which is what you show, but CM does not require that.

Jul 04 '22 11:07 wooorm

So a fully compliant implementation is free to make a different decision.

Could we improve the recommendation to explain the term “plain string” in a way that represents how several markdown parsers work?

Jul 04 '22 12:07 wooorm

OK, I see your issue is not really about whether the < is encoded, but whether the <e> is included at all.

As I mentioned, the spec is really silent on this -- the examples in this section just embody a recommendation -- so you're free to omit the raw HTML when generating the alt attribute. I agree that that's probably a better thing to do than what cmark currently does -- indeed, my own Haskell commonmark parser omits the tags.

I'd be okay with changes to the spec along these lines, I think, as long as the reference implementations were also updated to match. Note, however, that such a change would be an annoyance to all current implementers, who would find their spec tests failing and need to adjust things for a case that is probably pretty rare.

Jul 04 '22 12:07 jgm

the examples in this section just embody a recommendation

This is coming up several times with several people. One idea to clarify this is to mark test cases as illustrative/optional/recommendations with some attribute.

Note, however, that such a change would be an annoyance to all current implementers, who would find their spec tests failing and need to adjust things for a case that is probably pretty rare.

This to me sounds like the reason for CommonMark to exist and have a set of test cases that people pull in to test their parsers with. Much of markdown is edge cases. It also relates to my previous point on marking test cases as optional.

Jul 04 '22 12:07 wooorm

If you're up for making changes to spec, commonmark.js, and cmark, then you've got the green light!

Jul 04 '22 12:07 jgm

While I understand the question, I am hoping to contribute solely to the spec. I have no C knowledge, so contributing to cmark is out of my reach. I could potentially help with JavaScript, although I am already maintaining several markdown parsers and pressed for time, so I am not really interested in maintaining others as well.

Jul 04 '22 19:07 wooorm

The thing is, if you contribute in this way to the spec, then I have to modify the reference implementations, and that takes time. I don't like them to get out of sync, and I don't have time to spend on this right now. You can keep this issue open if you like.

Jul 05 '22 08:07 jgm

Right, that’s quite fair. Have you have though about opening up maintenance of commonmark? I imagine that I’d feel more inclined to work more on this if I’d have more sense of ownership. Another thought: while one reference parser is crucial, in the JavaScript world there are already several more popular commonmark-compliant markdown parsers. Perhaps it would relieve the burden of maintenance to archive commonmark.js? (cmark could be compiled to wasm for the dingus, or say my own micromark could be used)

Jul 05 '22 08:07 wooorm

Another thought: while one reference parser is crucial, in the JavaScript world there are already several more popular commonmark-compliant markdown parsers. Perhaps it would relieve the burden of maintenance to archive commonmark.js?

As someone who maintains a compliant parser and doesn't count C as a language I understand well, I personally find the JavaScript implementation to be extremely helpful in understanding the impact of spec changes. But I also know the burden of maintaining multiple projects too, so while I'd be bummed to lose the reference parser, it would be understandable.

Jul 25 '22 21:07 colinodell

Are there any updates on this? If not, can we at least get a clarification on what the expected behavior should be?

Note, however, that such a change would be an annoyance to all current implementers, who would find their spec tests failing

The result of updating the spec is a one-time change for some implementors. The result of not updating it is the continued stream of issues regarding the inconsistency between parsers. Latter is arguably more annoying.

Nov 19 '23 18:11 rlidwka

What is the precise change to the spec you think should be made?

Nov 19 '23 20:11 jgm

Just a general overview here. I'll try to suggest precise changes in the next post.

Option 1 - user friendly

Make it so image alt in output HTML is copied verbatim from ![here]:

![*hello* <bar>]() => <img alt="*hello* <bar>">

This seems to be what users expect.

This also has usability issues, because literal ] becomes impossible to add in alt. Perhaps, ![[ hello ] world ]]() syntax can be used to remedy this (similar to backticks), but this is probably a departure too far from existing implementations.

From implementation point of view it's also unclear:

either you store raw input alongside AST for links and images (performance concerns)
or you try to reconstruct initial input using source maps (not all implementations use source maps)
or mandate different handling for links and images (one produces AST, one doesn't), which is quite horrible

Option 2 - keep existing behavior

We now have 3 choices of what to do with ![hello <textarea>]():

<img src="hello <textarea>"> - that's commonmark.js
<img src="hello >textarea<"> - that's cmark
<img src="hello "> - that's haskell

Need to decide which one is correct.

Note that some implementations have an option to disable html rule entirely (at least we do). Having different results based on whether parser is able to parse html might be undesirable.

Nov 20 '23 07:11 rlidwka

In the OP I discussed these choices too, and I advised going with existing behavior, but not emitting html tags. Which is like opt 2 haskell

Nov 20 '23 08:11 wooorm

@jgm, now actually answering your question:

What is the precise change to the spec you think should be made?

Maybe spec should specify how to transform AST into "plain string content" for the purposes of forming image alt:

<text>hello</text> -> hello
<code>hello</code> -> hello
<softbreak /> -> \n
<html_block><a></html_block> -> <a> (same for inline)

The last rule makes sense for me personally (since it leads to the same behavior whether html rule exists or not). But maybe serializing htmls into an empty string is more "correct" in theory.

Also, I'd like to add a test to the spec along these lines:

If you use special symbols in image alt, you can wrap them into code span:
![`*em* <link>`]()

It's not going to fail anywhere (all parsers keep contents of code span as is hopefully), but it may be a useful suggestion for markdown writers (https://github.com/prettier/prettier/issues/15140) on how to deal with special characters.

(or mention in any other way that automated software should escape user content inside image alt when auto-generating markdown)

Nov 20 '23 08:11 rlidwka

commonmark-spec commonmark-spec copied to clipboard

Consider ignoring HTML (inline) in `alt` on image

Option 1 - user friendly

Option 2 - keep existing behavior

commonmark-spec
commonmark-spec copied to clipboard