commonmark-spec icon indicating copy to clipboard operation
commonmark-spec copied to clipboard

Use XML for spec examples

Open jgm opened this issue 10 years ago • 6 comments

We should use a direct XML rendering of the parse tree, of the sort produced by cmark -t xml --normalize, for the spec examples. The HTML we now use mixes two things, parsing and HTML rendering. Since the spec doesn't intend to specify all the details of HTML rendering, this is a bad fit.

See http://talk.commonmark.org/t/use-xml-for-the-spec-examples-and-tests-comments-welcome/994 for discussion.

For compactness, the <?xml instruction, the DOCTYPE, and the outer document node should be left out.

Todo:

[ ] Provide an option in cmark -t xml to produce a fragment, without the doctype and document node. [ ] Provide a converter that reads HTML and renders this XML output, so people can test their implementations against the spec without writing an XML renderer. Probably the easiest approach is to use the HTML parser in the python standard library, as we currently do with normalize. [ ] Use cmark to generate XML output for each of the examples in the spec, and replace the HTML with XML.

Possible issue with HTML -> XML conversion: how do we deal with escaped URLs? It's not clear that unescaping them is the right thing to do. This interacts with #270 and http://talk.commonmark.org/t/need-clarification-for-links-escaping-unescaping/998.

jgm avatar Jan 07 '15 19:01 jgm

Specifying the "result" of parsing and interpreting a CommonMark input text not in the form of an output HTML text is certainly a good idea. The specification should instead describe some kind of CommonMark "content model", and how an instance of this model (the abstract "result" of processing CommonMark) is derived from the input text.

"The" obvious way to define a "content model" is by way of a document type definition, like the current CommonMark.dtd does. (Despite some issues that I still have with this concrete DTD, as you might know ...)

The specification could then present XML fragments which would result from the various CommonMark constructs, where the complete result is understood to be obtained from substituting/concatenating these fragments. If I understand correctly, this is what you propose.

There are however subtle problems to consider in this approach: an XML text is a mixture of content (character data) and markup itself, and is not by itself a unique representation of the document content (as different XML texts can and do represent the same abstract content).

This is mostly a technical or notational problem, and there are several options to cope with it:

  1. Specify the result as canonical XML text (rsp fragments); thus forcing a one-to-one relation between XML text and document content.
  2. Do not use XML but a better suited notation to represent the document content. I suggest to consider:
    • the JSON notation, using the convention given in the MicroXML specification;
    • the RAST syntax, defined in ISO 13673.

I would argue that both JSON and RAST are better suited as a notation to specify the CommonMark parsing result (aka "the AST"), for the simple reason that they more clearly and cleanly separate content from markup.

And while "everybody" knows JSON, the RAST syntax might seem pretty obscure in comparison. It still has several advantages over JSON in our context:

  1. It was explicitly designed as a syntax to represent the parsing result (output from an SGML parser) of a "structured document", in other words
  2. it was explicitly designed to represent the "parsed document content" (aka element structure information set (ESIS) respectively XML information set),
  3. it was explicitly designed to be "human-readable" and unambiguous (therefore using only printable ISO 646 IRV aka US-ASCII characters),
  4. it was designed so that two documents map to the same RAST data iff the two documents are "equivalent", ie have the same "parsed content",
  5. implementing RAST output in a CommonMark processor is trivially simple (been there, done that),
  6. because the freely available SGML (and XML) parsers SP rsp OpenSP can produce RAST output, there is an obvious and simple way to test that eg the HTML or XHMTL output of a CommonMark processor matches the "abstract" CommonMark content model.

To compare the properties of each option, here is what the "transformed" result of the following short CommonMark text would look like:

Lorem ipsum
-----------

Dolor sit amet, consectetur adipiscing elit:

  - sed&#160;do eiusmod tempor incididunt ut 
  - labore et dolore magna aliqua.

Note the U+00A0 NO-BREAK SPACE inserted between sed and do.

Canonical XML

Output from cmark -t xml, passed through xmlwf to "canonicalize" it:

<header level="2">&#10;    <text>Lorem ipsum</text>&#10;  </header>&#10;  <paragraph>&#10;    <text>Dolor sit amet, consectetur adipiscing elit:</text>&#10;  </paragraph>&#10;  <list tight="true" type="bullet">&#10;    <item>&#10;      <paragraph>&#10;        <text>sed</text>&#10;        <text> </text>&#10;        <text>do eiusmod tempor incididunt ut</text>&#10;      </paragraph>&#10;    </item>&#10;    <item>&#10;      <paragraph>&#10;        <text>labore et dolore magna aliqua.</text>&#10;      </paragraph>&#10;    </item>&#10;  </list>

Note that the numeric character reference for the NBSP has been replaced by an actual U+00A0 character: this is required by the CommonMark specification, and in the Canonical XML specification too. It can be "found" between <text> and </text> after the word sed ...

With "artifact" white space (ie white space which is not part of the document content) removed (it is my understanding that a Canonical XML document may only contain white space which is part of the document's content, ie Information Set):

<header level="2"><text>Lorem ipsum</text></header><paragraph><text>Dolor sit amet, consectetur adipiscing elit:</text></paragraph><list tight="true" type="bullet"><item><paragraph><text>sed</text><text> </text><text>do eiusmod tempor incididunt ut</text></paragraph></item><item><paragraph><text>labore et dolore magna aliqua.</text></paragraph></item></list>

Not really an improvement regarding "readability", but the NBSP kind-of sticks out more here.

JSON

Using the MicroXML data model: every element in the document content is represented by a JSON array containing three items:

  1. The element's GI as a string,
  2. the element's attributes as a JSON "object", ie a map from attribute name (strings) to attribute value (strings),
  3. the element's (parsed) content: a JSON array containing
    • strings containing character data, and
    • three-element arrays for child elements.

(Hand-edited JSON text:)

[ "header" , { "level" : "2" } , [
  [ "text" , {} , [ "Lorem ipsum" ]]]],
[ "paragraph" , {} , [
  [ "text" , {} , [ "Dolor sit amet, consectetur adipsicing elit:" ]]]],
[ "list" , { "type" : "bullet", "tight" : "true" } , [
  [ "item" , {} , [
    [ "paragraph" , {} , [
  [ "text" , {} , [ "sed" ]],
  [ "text" , {} , [ "\u00A0" ]],
  [ "text" , {} , [ "do eiusmod tempor incididunt ut" ]]]]]],
  [ "item" , {} , [
    [ "paragraph", {} , [
  [ "text" , {} , [ "labore et dolore magna aliqua." ]]]]]]]]

Aside from the bracketing orgy that only a Scheme programmer can love, this is IMO far more readable and malleable than the Canonical XML representation.

The NBSP is represented here by a JSON "Unicode escape" sequence; it could as well have occured literally inside the string. Not that this means that without introducing further conventions, the JSON representation is ambiguous.

Because the character data is all inside JSON string literals, one can reformat the JSON text independently.

An obvious advantage of using JSON would be that "parsers" for JSON exist everywhere, thus tools to check, analyze, compare, report test case output would be really easy to build.

RAST

(Output from my cm2doc CommonMark processor, with renamed GIs to match CommonMark.dtd):

[header
level=
!2!
]
[text]
|Lorem ipsum|
[/text]
[/header]
[paragraph]
[text]
|Dolor sit amet, consectetur adipiscing elit:|
[/text]
[/paragraph]
[list
tight=
!true!
type=
!bullet!
]
[item]
[paragraph]
[text]
|sed|
[/text]
[text]
#160
[/text]
[text]
|do eiusmod tempor incididunt ut|
[/text]
[/paragraph]
[/item]
[item]
[paragraph]
[text]
|labore et dolore magna aliqua.|
[/text]
[/paragraph]
[/item]
[/list]

This is a basically line-oriented format. The element start and end tag syntax should be obvious. Character data is written enclosed in | on separate lines (like the title Lorem ipsum), but "special characters" outside the ISO 646 IRV G0 set are written as a Unicode code point (like #160 for U+00A0 NBSP).

For the "elements with attributes and character data" model this all the RAST syntax one needs; to represent the processing instructions that CommonMark recognizes is equally simple.

tin-pot avatar Dec 23 '15 02:12 tin-pot

I would say that

  <header level="2">
    <text>Lorem ipsum</text>
  </header>
  <paragraph>
    <text>Dolor sit amet, consectetur adipiscing elit:</text>
  </paragraph>
  <list type="bullet" tight="true">
    <item>
      <paragraph>
        <text>sed</text>
        <text> </text>
        <text>do eiusmod tempor incididunt ut</text>
      </paragraph>
    </item>
    <item>
      <paragraph>
        <text>labore et dolore magna aliqua.</text>
      </paragraph>
    </item>
  </list>

is much more readable than the JSON

[ "header" , { "level" : "2" } , [
  [ "text" , {} , [ "Lorem ipsum" ]]]],
[ "paragraph" , {} , [
  [ "text" , {} , [ "Dolor sit amet, consectetur adipsicing elit:" ]]]],
[ "list" , { "type" : "bullet", "tight" : "true" } , [
  [ "item" , {} , [
    [ "paragraph" , {} , [
  [ "text" , {} , [ "sed" ]],
  [ "text" , {} , [ "\u00A0" ]],
  [ "text" , {} , [ "do eiusmod tempor incididunt ut" ]]]]]],
  [ "item" , {} , [
    [ "paragraph", {} , [
  [ "text" , {} , [ "labore et dolore magna aliqua." ]]]]]]]]

It's easier to see how the nesting works with the XML, and the JSON's quotation marks and brackets are distracting. (And both are more readable than the RAST, where you have to count to determine nesting.) Now of course if you remove all the spaces from the XML, it isn't very readable. But the same is true of the JSON.

Of course you're right that two (semantically, not just syntactically) different XML texts can correspond to the same CommonMark AST, because of the whitespace around elements. But we could always say that you measure conformance by seeing if the canonicalized XML matches the canonicalized version of the XML in the spec. There's no need to present it in canonicalized (unreadable) form in the spec. Readability is important in the spec.

Another possible move would be to change things so that text only occurs in attributes:

<text content="Hi there"/>

instead of

<text>Hi there</text>

With this small change, the canonicalization procedure would be simple: remove all text nodes.

I think that most likely we'll stay with HTML for the spec examples for the near future. If we used XML (or JSON or RAST) we'd need to either (a) require that conforming implementations provide an XML/JSON/RAST renderer or (b) provide some program that converts the HTML output of the tested implementation into CommonMark XML. (a) is an onerous requirement on implementers, and (b) introduces a lot of extra complexity into the test procedure.

jgm avatar Dec 23 '15 04:12 jgm

Would it be possible for the spec to indicate both the expected AST and the reference rendering of the expected HTML? Implementations could then target either AST compliance or exact HTML compliance. That way, existing implementations can continue to target HTML compliance while only the implementations that need to target AST compliance need to worry about providing an AST renderer.

(By the way, my instinct would be to provide an AST parser, not a renderer. That way, I could parse the Markdown input into whatever internal AST my library needs, parse the expected AST into an internal AST, then compare the two AST's, without rendering either of them.)

CPColin avatar Mar 24 '17 17:03 CPColin

+++ Colin Bartolome [Mar 24 17 10:33 ]:

Would it be possible for the spec to indicate both the expected AST and the reference rendering of the expected HTML?

In principle, yes, but then it becomes hard to fit side-by-side columns in a reasonable width.

jgm avatar Mar 30 '17 07:03 jgm

Could the translation from TXT to HTML leave out the AST? Or maybe give each bit of expected AST and HTML certain CSS classes, hide the AST, then add a button that toggles between them?

CPColin avatar Mar 30 '17 15:03 CPColin

Couldn’t you include the html for the code/readers benefit/information after line with the “section” Information and before the ending line validation?

Annastaciahubbard avatar Nov 23 '19 06:11 Annastaciahubbard