commonmark-spec icon indicating copy to clipboard operation
commonmark-spec copied to clipboard

Precedence rules of syntax markers (inlines and blocks)

Open mity opened this issue 9 years ago • 2 comments

Precedence rules of inline marks are spread over whole part 6 of the specification.

E.g. in the chapter 6.4 about emphasis and strong emphasis:

Inline code spans, links, images, and HTML tags group more tightly than emphasis.

Or in the chapter 6.5 about links:

The brackets in link text bind more tightly than markers for emphasis and strong emphasis. Thus, for example, *[foo*](url) is a link.

But when fast-scanning the specification it is easy to miss such notes.

Worse, when implementing features chapter by chapter you can meet it in 6.4, after spending quite a lot of work on 6.1, 6.2 and 6.3 and only then learning it all has to be done differently.

I suggest there should be an extra chapter just about the precedence of inline marks ti highlight this more, and to make the precedence rules as simple as possible to understand, without a need to condensate it from little notes all over the specification.

mity avatar Nov 25 '16 14:11 mity

Would love to see a formal section on precedence rules for everything, even block interruption rules are quite spread out.

E.g. With lists, the precedence rules that an empty list item may not interrupt a paragraph and a list item starting a a number other than 1 may not interrupt a paragraph are spread apart by twenty examples.

aidantwoods avatar May 22 '17 00:05 aidantwoods

For blocks, the info is also quite scattered, but imho it is much easier to put in some words then the inlines. Probably because different inlines are analyzed quite differently (consider forward scanning to some closer mark versus stacks of opener and closer marks, nesting rules for links/images etc.).

Or maybe it is just my current impression because I needed to review this part of code so it's fresh in my head ;-).

If I take a look at MD4C source (function md_analyze_line()), the precedence list for blocks might look like follows below (extensions to CommonMark are omitted here).

In some cases, I have put multiple things into a single step below. In these cases the precedence should not matter.

  1. Container (block quote and list item) continuation line (i.e. whether we belong to a "parent" container opened on some previous line.)

    I.e. analyze > and line indentation, as deeply (but not more deeply) as the stack of currently opened containers.

    Rationale: This must be 1st as anything can be nested in list or in block quotes or some combination of both (as both can be nested also in each other).

    • Fenced code block continuation line.
    • HTML block continuation line.

    Rationale: These two blocks, when we already know we are in them, continue no matter what unless there is some specific end condition.

  2. Blank line.

    Rationale: Blank line has a block interrupting effect for more or less anything below (and nothing above).

    • Setext underline. (if it follows a single line of a normal paragraph).
    • Thematic break line.

    Note for Setext: If it is Setext underline, change type of the previous line to the heading instead of a paragraph.

    Rationale: Setext underline is before Step (5) because it must share the container block nesting with the preceding line. Thematic break line (* * *) has precedence over nested list items.

  3. "Brother" list item.

    I.e whether we are just another list item in some already started list.

    If yes, close previous list item, open a new one, and go back to Step (3) (after consuming the processed list item mark, so only the rest of the line gets analyzed).

    Rationale: This is after (4) because * * * (if not followed with anything) is a thematic break rather than a nested list item.

  4. Indented code.

    Rationale: Must be after (5) because list item or quote blocks markers can be indented quite a lot if nested deeply in an hierarchy of lists.

  5. "Child" container.

    I.e. whether there is list item or quote block marker opening new (nested) container.

    If yes, open the new container (nested in parent and/or brother containers as detected in the previous steps), eat the marker, and go back to Step (3).

    • ATX header.
    • Fenced code block (initial line, i.e. the fence).
    • HTML block (initial line, i.e. the starting condition).
  6. Lazy continuation line. Actually this is not really a bock type, but a special rule that logically joins a paragraph inside a container block with following text into a single paragraph, even if the analyzed line is not indented enough to be part of the container (as required in the 1st step).

    Rationale: This rule has very low precedence so that the other block types above can follow a preceding container block without a delimiting blank line.

  7. Paragraph.

    Rationale: This is the last because paragraph is anything what's not recognized as anything else above.

  8. Link reference definitions (LRD).

    These are not done easily during line analysis because they are too complex and can span over multiple lines so MD4C defers it until it knows complete paragraphs.

    Imho, the most simple way is to test paragraphs (when all paragraph lines are known) whether they start by line(s) that can be seen as LRD. If yes, handle them and rip those line(s) from the paragraph text. Repeat until the initial line(s) do not form valid LRD or until the paragraph is empty. If empty, discard the paragraph completely. (I.e. make sure to not emit <p></p>). Cmark works the same here, AFAIK.

Disclaimers:

  • I am not brave enough to word it in some formal way for inclusion into the specs.
  • I also do not claim this is the only possible solution.

mity avatar Mar 26 '19 21:03 mity