commonmark-spec Narrow the definition of character to Unicode encoded character

Fixes #791

Mar 18 '25 13:03 tats-u

No entry of Surrogate Code Unit in https://www.unicode.org/glossary/.

Mar 18 '25 13:03 tats-u

c.f. https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md

Mar 18 '25 13:03 tats-u

Another plan: anyhow exclude U+FFFD from Unicode punctuation characters (e.g. Exclude So)

Mar 23 '25 12:03 tats-u

Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8).

If an application that uses it accepts files from sources that might not contain valid unicode (such as files on the filesystem or the win32 text entry API), the application needs to convert its input to utf-8. The standard library offers an API to replace lone surrogates with U+FFFD, and it offers an API to return an error.

I always assumed, because the spec says that it doesn't specify an encoding, that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner.

Apr 25 '25 18:04 notriddle

it offers an API to return an error

I changed the behavior to undefined behavior. By this pulldown-cmark will be allowed to do anything without notice.

that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner.

I have not assumed the case that parsers throw errors. I have assumed only the U+FFFD replacement.

Apr 27 '25 12:04 tats-u

Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8).

I also don't see any advantage of importing encoding matters into the specification. A long time ago I suggested to simply define the standard over sequences Unicode scalar values (which is the data you get as the result of any valid UTF decode), see #369.

Apr 28 '25 20:04 dbuenzli

I would be fine with that.

Apr 28 '25 23:04 notriddle

Unicode scalar values

It's sufficient for the time being, but I've advocated the CJK-friendly amendments as the fix of #650. In there, Unicode Noncharacters and Reserved Code Points obstructs the optimization of the CJK ranges. e.g. U+3097–U+3098 are reserved and U+2FFFE–U+2FFFF are noncharacters but both intervals should be treated as CJK to reduce the number of product terms (0xXXXX <= codePoints && codePoints <= 0xYYYY).

Neither of Unicode Noncharacters or Reserved Code Points are input by normal users other than testers.

Apr 29 '25 12:04 tats-u

"surrogate characters" doesn't exist. Only "surrogate code points" and "surrogate code units" do. All of "Unicode scalar value", "Encoded character", and "Assigned character" exclude surrogate code points from their ranges. Can I add "Fixes #369" to the top description?

Apr 29 '25 13:04 tats-u

I also don't see any advantage of importing encoding matters into the specification.

UTF-8 has invalid verbose encodings. They must be treated as the same way as isolated surrogate code units especially in UTF-16 (UTF-8 can contain an encoding for surrogate code units like CESU-8). We should ignore all of them and leave them to implementations.

Apr 29 '25 13:04 tats-u

I'm not sure I fully understand what you are trying to achieve with the definition. But for me a fix to #791 is to simply evacuate the notion of encoding from the specification.

The idea of #369 is to define the CommonMark grammar over a stream of Unicode scalar values, more precisely a stream of integers in the ranges 0x0000..0xD7FF to 0xE000..0x10FFFF. Such a definition just says: a CommonMark document is defined over any valid Unicode text (and thus what happens on invalid encodings is unspecified by the specification). That way you don't even need to talk about surrogates or ill formed sequences. The only other thing you need to say is what happens if you input a surrogate code point using an escape, here the text can simply indicate that it must be replaced by the unicode replacement character U+FFFD.

P.S. Excluding reserved code points without mentioning an explicit Unicode version doesn't make much sense. This set of code points is shrinking every year as new characters are added to the standard.

Apr 29 '25 14:04 dbuenzli

a stream of Unicode scalar values, more precisely a stream of integers in the ranges 0x0000..0xD7FF to 0xE000..0x10FFFF

There is a convenient term Well-formed Code Unit Sequence.

here the text can simply indicate that it must be replaced by the unicode replacement character U+FFFD.

Implementations should be allowed to emit errors for ill-formed code unit subsequences, too. Of course replacing with U+FFFD is fine.

Apr 29 '25 14:04 tats-u

what happens if you input a surrogate code point using an escape

Should be the same as HTML Living Standard. The specs text about it should be revised too. HTML Living Standard replaces them with U+FFFD.

Apr 29 '25 14:04 tats-u

There is a convenient term Well-formed Code Unit Sequence.

No that applies to sequences of code units (8-bit, 16-bit or 32-bit depending on your UTF) you are still talking at the encoding level here. There's no need to. You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.

Implementations should be allowed to emit errors for ill-formed code unit subsequences, too.

Again, if you simply switch to a sequence of scalar values, you don't have to talk about that. Leave it to implementer, some UTF decoders fail hard on decode errors, some silently replace them with U+FFFD (according to different strategies), some give the choice. There's no need to talk about that in the CommonMark specification.

Apr 29 '25 16:04 dbuenzli

Code units themselves are not dedicated to UTF. Only Well-Formed / Ill-Formed Code Unit (Sub)sequences are. (I overlooked them) e.g.

The same code unit sequence could, of course, be well-formed in the context of some other character encoding standard using 8-bit code units, such as ISO/IEC 8859-1, or vendor code pages.

From: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860

You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.

I've taken non-Unicode encodings into account. i.e. file content <=> code units (UTF or legacy 8-bit) <=> encoded characters or scalar values Also, we need to clarify the behavior when decode is not going well. Anyway we should make implementers feel secure.

This set of code points is shrinking every year as new characters are added to the standard.

This doesn't bring breaking changes to existing Markdown documents where older Unicode versions are used, which should be mostly avoided. Implementations have only to update their Unicode versions and prepare for the updates.

if you simply switch to a sequence of scalar values, you don't have to talk about that.

Not all strings, character arrays, and files contain only code units that are part of scalar values. We have to prepare for exceptions by explicitly clarifying that "you can do anything".

May 01 '25 10:05 tats-u

I delegate "HTML5" → "HTML Living Standrd" to another PR to keep this PR simpler.

https://github.com/commonmark/commonmark-spec/issues/805

Jun 01 '25 14:06 tats-u