pulldown-cmark icon indicating copy to clipboard operation
pulldown-cmark copied to clipboard

U+0000 not converted to U+FFFD in all cases

Open ollie27 opened this issue 8 years ago • 6 comments

The CommonMark Spec says:

For security reasons, the Unicode character U+0000 must be replaced with the REPLACEMENT CHARACTER (U+FFFD).

However pulldown-cmark doesn't preform this conversion in some cases. Take the following example (with "NUL" replaced with actual null bytes):

# NUL

NUL

�

[NUL](NUL)

`NUL`

```
NUL
```

dingus produces the following:

<h1>�</h1>
<p>�</p>
<p>�</p>
<p><a href="%EF%BF%BD">�</a></p>
<p><code>�</code></p>
<pre><code>�
</code></pre>

pulldown-cmark produces (with "NUL" replaced with actual null bytes):

<h1>�</h1>
<p>�</p>
<p>�</p>
<p><a href="%00">�</a></p>
<p><code>NUL</code></p>
<pre><code>NUL
</code></pre>

The link destination is different and the null byte isn't converted in the code span or block.

ollie27 avatar Aug 06 '17 16:08 ollie27

Hmmm.. the master branch now does replace the nulls inside the code blocks, but doesn't recognize the links any more. Double checking the spec to see if this is intended behavior..

marcusklaas avatar Mar 15 '19 15:03 marcusklaas

I think this remains a TODO. I've personally prioritized it pretty low because nulls in strings are handled smoothly by just about everything in the Rust ecosystem, but in the C things are dicier.

raphlinus avatar Mar 15 '19 16:03 raphlinus

It looks like we break here: https://github.com/raphlinus/pulldown-cmark/blob/master/src/parse.rs#L1892

Do you remember why these first ASCII bytes are treated specially? Can't find any reference to them in the spec.

marcusklaas avatar Mar 15 '19 16:03 marcusklaas

Yes, the spec for link-destination says "a nonempty sequence of characters that does not include ASCII space or control characters", and the code is supposed to reflect that. That's not saying I got this completely right.

raphlinus avatar Mar 15 '19 16:03 raphlinus

Right, thanks. I think we may actually be compliant in that case?

marcusklaas avatar Mar 15 '19 16:03 marcusklaas

My reading is that this is a spec ambiguity. Should the NUL->U+FFFD replacement happen before or after parsing of link declarations? The spec doesn't say. If it's after, then not recognizing the links is correct. But before is what the dingus appears to do.

This is what I would call an extreme edge case. But I think we should file a number of spec clarification issues against CommonMark, and this is definitely one of them.

raphlinus avatar Mar 15 '19 16:03 raphlinus