commonmark-spec icon indicating copy to clipboard operation
commonmark-spec copied to clipboard

[Link Reference Definition] Block starts in Link Title that spans multiple lines break the definition

Open movsb opened this issue 4 years ago • 6 comments

What the spec says:

A link title consists of either

  • a sequence of zero or more characters between straight double-quote characters ("), including a " character only if it is backslash-escaped, or

  • a sequence of zero or more characters between straight single-quote characters ('), including a ' character only if it is backslash-escaped, or

  • a sequence of zero or more characters between matching parentheses ((...)), including a ( or ) character only if it is backslash-escaped.

Although link titles may span multiple lines, they may not contain a blank line.

But if block starts like #, >, * start a new line of link title. The Link will be broken.

I see that the reference CommonMark javascript implementation first parses link reference definition lines into a paragraph, then tries to parse it as a link reference definition. But block starts like #,* and > as well as blank lines break Paragraphs. So, the link reference definitions are also broken.

    paragraph: {
        continue: function(parser) {
            return (parser.blank ? 1 : 0);
        },
        finalize: function(parser, block) {
            var pos;
            var hasReferenceDefs = false;

            // try parsing the beginning as link reference definitions:
            while (peek(block._string_content, 0) === C_OPEN_BRACKET &&
                   (pos =
                    parser.inlineParser.parseReference(block._string_content,
                                                       parser.refmap))) {
                block._string_content = block._string_content.slice(pos);
                hasReferenceDefs = true;
            }
            if (hasReferenceDefs && isBlank(block._string_content)) {
                block.unlink();
            }
        },
        canContain: function() { return false; },
        acceptsLines: true
    }

Example link reference definition (dingus link):

[foo]: /url "
first line
# this should be a heading
second line
"

Just think that #, >,*, especially blank lines are common in Titles, for example, a commit message in commonmark.js:

dingus: add dependency version requirements (#159)

* dingus: add dependency version requirements

Dingus was rendering incorrectly with Bootstrap 4. Added a bower.json
which requires Bootstrap, jQuery and Lodash with major version equal
to what's currently live. Likewise the minimum patch version.

* dingus: require lodash ^4.17.11.

what GitHub rendered:

<a data-pjax="true" title="dingus: add dependency version requirements (#159)

* dingus: add dependency version requirements

Dingus was rendering incorrectly with Bootstrap 4. Added a bower.json
which requires Bootstrap, jQuery and Lodash with major version equal
to what's currently live. Likewise the minimum patch version.

* dingus: require lodash ^4.17.11." class="link-gray" href="/commonmark/commonmark.js/commit/a7eb3af149a7e22526a3eb40216504d2ecdd0c8a">dingus: add dependency version requirements (</a>

Adding backslashs seems to work. but the spec doesn't statement this.

If I try to use backslash to escape a new line, the new line preserves (not be a space), so does the backslash. (dingus link)

[foo]: /url "
first line
\# not a heading
\
second line
"

[foo]

movsb avatar Nov 01 '19 14:11 movsb

But see https://spec.commonmark.org/0.29/#precedence which says that indicators of block structure always take precedence over indicators of inline structure.

jgm avatar Nov 02 '19 14:11 jgm

Related issue: #605.

I think the underlying issue here is that the reference implementations parse these as paragraphs, for parsing efficiency, but they aren't actually spec'd as paragraphs. So the spec may need some tightening up in this respect.

jgm avatar Nov 02 '19 14:11 jgm

@jgm Thanks for replying.

that indicators of block structure always take precedence over indicators of inline structure.

In the Spec, the Link Reference Definition is under the Leaf blocks section, so they have the same precedence?

First I find it hard to use a state machine to parse definitions, so I look into how commonmark.js implements this. Now I know it treats Link Reference Definition as an only inline of a paragraph.

I think using backslashes may be a good solution as long as we carefully process the backslash with new line?

movsb avatar Nov 02 '19 15:11 movsb

@movsb I am currently writing a state machine based spec and implementation on how to parse CM, this is what lead me to this issue as well. The way I do definitions and paragraphs is first parse "content" (this is both link definitions and paragraphs). When content is closed, it is first parsed for zero or more definitions, the rest is phrasing. If there is phrasing, and the content was closed because of a setext heading underline, the phrasing becomes part of a heading. If there was no phrasing, the setext heading underline itself becomes the start of a new content. Otherwise, if the content was closed normally, then the phrasing becomes part of a paragraph. Furthermore, I use four state machines: one for flow constructs ("block"), one for content, one for rich text ("inline"), and one for plain text (text with escapes and character references)

wooorm avatar Nov 02 '19 15:11 wooorm

@wooorm Oh, look forward to seeing that.

A question: what conditions will make a content closed? A blank line?

My state machine is only for definitions parsing. It reports a state whether blank lines can be continue lines.

for each line of lines:
    switch state:
        case label:
        case url:
        case title:
        case title_span_lines:
              return need_more_line
  • If the state machine is not yet finished, every single line is sent to it.
  • If the state machine fails, every line it consumed are reconstructed into a paragraph.

It's a WIP, don't know if it works.

BTW, Parsing blank lines is the most tricky thing in my implementation.

movsb avatar Nov 02 '19 16:11 movsb

This conversation is a bit off topic, but thematic breaks, atx headings, setext heading underlines, fenced code blocks, HTML blocks (except type 7), blank lines, block quotes, and lists/list-items all close content. Indented code does not.

I personally don’t see blank lines as difficult, so I’m unsure what you mean calling them tricky?

wooorm avatar Nov 02 '19 23:11 wooorm