tsdoc icon indicating copy to clipboard operation
tsdoc copied to clipboard

RFC: TSDoc-flavored-Markdown (TSFM) instead of CommonMark

Open octogonz opened this issue 6 years ago • 7 comments

Based on the issues encountered in the issue #12 thread, we are concluding that TSDoc cannot reasonably be based directly on the CommonMark spec. The goals are conflicting:

  • CommonMark goal: ("common" = union) Provide a standardized algorithm for parsing every familiar markup notation. It's okay if the resulting syntax rules are impossible for humans to memorize, because mistakes can be easily corrected using the editor's interactive preview. If a syntax is occasionally misinterpreted, the consequence is incorrect formatting on the web site, which is a relatively minor issue.

  • TSFM goal: ("common" = intersection) Provide a familiar syntax that is very easy for humans to memorize, so that a layperson can predict exactly how their markup will be rendered (by every possible downstream doc pipeline). Computer source code is handled by many different viewers which may not support interactive preview. If a syntax is occasionally misinterpreted, the consequence is that a tag such as @beta or @internal may be ignored by the parser, which could potentially cause a serious issue (e.g. an escalation from an enterprise customer whose service was interrupted because of a broken API contract).

Hypothesis: For every TSFM construct, there exists a normalized form that will be parsed identically by CommonMark and TSDoc. In "strict mode" the TSDoc library can issue warnings for expressions that are not in normalized form. Assuming the author eliminates all such warnings, then a documentation pipeline can passthrough unmodified TSDoc content to a backend CommonMark engine, and have confidence that the rendered output will be correct.

Below are some proposed TSFM restrictions:

Whitespace generally doesn't matter

This principle is very easy for people to remember, and eliminates a ton of edge cases.

Example 1:

/**
 * TSFM considers this to be an HTML element, whereas CommonMark does not:
 * <element attribute="@tag"
 *
 * />
 */

Example 1 converted to normalized form (so CommonMark interprets it the same as TSDoc):

/**
 * TSFM considers this to be an HTML element, whereas CommonMark does not:
 * <element attribute="@tag"
 * />
 */

Example 2:

/**
 * CommonMark interprets this indentation to make a code block, TSFM sees rich markup:
 * 
 *     **bold** @tag
 */

Example 2 converted to normalized form (so CommonMark interprets it the same as TSDoc):

/**
 * CommonMark interprets this indentation to make a code block, TSFM sees rich markup:
 * 
 * **bold** @tag
 */

Stars cannot be nested arbitrarily

TSDoc will support stars for bold/italics, based on 6 types of tokens that can be recognized by the lexical analyzer with minimal lookahead:

  • Opening italics single-star, e.g. *text is interpreted as <i>text
  • Closing italics single-star, e.g. text* is interpreted as text</i>
  • Opening bold double-star, e.g. **text is interpreted as <b>text
  • Closing bold double-star, e.g. text** is interpreted as text</b>
  • Opening bold+italics triple-star, e.g. ***text is interpreted as if <b+i>text
  • Closing bold+italics triple-star, e.g. text*** is interpreted as if text</b+i>

Other patterns are NOT interpreted as star tokens, e.g. text * text * contains literal asterisks, as does ****a****. A letter in the middle of a word can never be styled using stars, e.g. Toys*R*Us contains literal asterisk characters. A single-star followed by a double-star can be closed by a triple-star (e.g. *italics **bold+italics*** is seen as <i>italics<b>bold+italics</b+i>). Star markup is prohibited from spanning multiple lines.

Other characters (e.g. underscore) are NOT supported by TSDoc as synonyms for bold/italics.

Example 3:

/**
 * *CommonMark sees italics, but TSDoc does not because
 * its stars cannot span lines.*
 *
 * CommonMark sees italics here: __proto__
 *
 * Common**M**ark sees a boldfaced M, but TSDoc sees literal stars.
 */

Example 3 normalized form:

/**
 * \*CommonMark sees italics, but TSDoc does not because
 * its stars cannot span lines.\*
 *
 * CommonMark sees italics here: \_\_proto\_\_ (or better to use `__proto__`)
 *
 * Common\*\*M\*\*ark sees a boldfaced M, but TSDoc sees literal stars.
 *
 * If you really need to boldface a letter, use HTML elements: Common<b>M</b>ark.
 */

Example 4:

/**
 * For **A **B** C** the B is double-boldfaced according to CommonMark.
 * The TSDoc tokenizer sees `<b>A <b>B</b> C</b>` which the parser then flattens
 * to `<b>A **B</b> C**` because it doesn't allow nesting.
 *
 * Improper balancing also gets ignored, e.g. for **A *B** C* the TSDoc tokenizer
 * will see `<b>A <i>B</b> C</i>` which the parser flattens to `<b>A *B</b> C*`
 * Whereas CommonMark would counterintuitively see `<i><i>A<i>B</i></i>C</i>`.
 */

Example 4 normalized form:

/**
 * For **A \*\*B** C\*\* the B is double-boldfaced according to CommonMark.
 * The TSDoc tokenizer sees `<b>A <b>B</b> C</b>` which the parser then flattens
 * to `<b>A **B</b> C**` because it doesn't allow nesting.
 *
 * Improper balancing also gets ignored, e.g. for **A \*B** C\* the TSDoc tokenizer
 * will see `<b>A <i>B</b> C</i>` which the parser flattens to `<b>A *B</b> C*`
 * Whereas CommonMark would counterintuitively see `<i><i>A<i>B</i></i>C</i>`.
 */

Code spans are simplified

For TSFM, a nonescaped backtick will always start a code span and end with the next backtick. Whitespace doesn't matter.

Example 5:

/**
 * `Both TSDoc and CommonMark
 * agree this is code.`
 *
 * before `CommonMark disagrees
 *
 * if a line is skipped, though.` after
 *
 * `But this is not code because the backtick is unterminated
 */

Example 5 normalized form:

/**
 * `Both TSDoc and CommonMark
 * agree this is code.`
 *
 * before `CommonMark disagrees
 * if a line is skipped, though.` after
 *
 * \`But this is not code because the backtick is unterminated
 */

Blocks don't nest

I want to say that ">" blockquotes should not be supported at all, since the whitespace handling for these constructs is highly counterintuitive. Instead we would recommend <blockquote> HTML tags for this scenario.

Lists are a very useful and common scenario. However, CommonMark lists also have a lot of counterintuitive rules regarding handling of whitespace.

A simplification would be to say that TSFM interprets any line that starts with "-" as being a list item, and the list ends with the first blank line. No other character (e.g. "*" or "+") can be used to create lists. If complicated nesting is required, then HTML tags such as <ul> and <li> should be used to avoid any confusion.

Example 6:

/**
 * A list with 3 things
 * - item 1
 *              - item 2
 * spans several
 *      lines
 * - item 3
 *
 * Two lists separated by a newline
 * -  list 1 with one item
 *
 * - list 2 with one item
 *
 * + not a list item
 * + not a list item
 *
 * CommonMark surprisingly considers this to be a list whose first item is another list,
 * whereas TSDoc sees a minus character as the first item:
 * - - foo
 */

Example 6 normalized form:

/**
 * A list with 3 things
 * - item 1
 * - item 2
 *   spans several
 *   lines
 * - item 3
 *
 * Two lists separated by a newline
 * -  list 1 with one item
 * <!-- CommonMark requires an HTML comment to separate two lists -->
 * - list 2 with one item
 *
 * \+ not a list item
 * \+ not a list item
 * 
 * CommonMark surprisingly considers this to be a list whose first item is another list,
 * whereas TSDoc sees a minus character as the first item:
 * - \- foo
 */

octogonz avatar Jun 28 '18 00:06 octogonz

I started prototyping this idea today. If we go this route, it will be a major simplification and should save us a lot of time on the implementation.

octogonz avatar Jun 28 '18 04:06 octogonz

Something worth calling out here is how this can interact with docs.microsoft.com/DocFX. Now, I know that we are working on a standard here, but fragmentation and a bunch of custom stuff is a bit of a concern. We do have support for Markdown Extensions, so likely that should be a place where we can plug in.

The format you are talking about here is parser-specific - on docs.microsoft.com, we've recently switched to MarkDig, that handles CommonMark parsing much better. It would be preferable to not be inventing our own standard due to the fact that the rest of the documentation stack does not use (and we have no plans to), and guiding people to one set of conventions for TS documentation contributions and another one for the rest of docs seems problematic. Besides, this also adds the added issue of our own parser interpreting the proposed conventions incorrectly.

dend avatar Jul 09 '18 17:07 dend

The format you are talking about here is parser-specific - on docs.microsoft.com, we've recently switched to MarkDig, that handles CommonMark parsing much better. It would be preferable to not be inventing our own standard due to the fact that the rest of the documentation stack does not use (and we have no plans to), and guiding people to one set of conventions for TS documentation contributions and another one for the rest of docs seems problematic. Besides, this also adds the added issue of our own parser interpreting the proposed conventions incorrectly.

MarkDig is only a .NET implementation right? Won't that be problematic for JavaScript/TypeScript authors (i.e. they'd need a way to run .NET on their machine or CI server).

dschnare avatar Dec 01 '18 13:12 dschnare

@pgonzal The current examples show an input that would be treated differently by the two implementations, but only show one normalized form. Each example should be presented with two normalized forms:

  1. The normalized form that causes both parsers to interpret the input in the manner that TSFM interprets the original input
  2. The normalized form that causes both parsers to interpret the input in the manner that CommonMark interprets the original input (this is the one that's missing)

sharwell avatar Dec 03 '18 14:12 sharwell

While I agree with encouraging users to use a normalized input form where available, I generally disagree with the premise of this proposal (that CommonMark is problematic and/or suggesting that another Markdown flavor will serve to simplify the space). The following are specific claims which I most disagree with:

I want to say that ">" blockquotes should not be supported at all, since the whitespace handling for these constructs is highly counterintuitive.

This form is widely used (it's even the default behavior when you click the quote formatting button in the GitHub editor), and everyone since the dawn of email knows that > at the beginning of a line means a quote.

However, CommonMark lists also have a lot of counterintuitive rules regarding handling of whitespace.

These rules may cause some confusion for new users, but the general availability of live-preview editors helps users avoid the pitfalls.

No other character (e.g. "*" or "+") can be used to create lists.

This is confusing. Many people (myself included), only use one of these for lists.

… then HTML tags ... should be used …

For any case directly supported by CommonMark without the use of HTML tags, I would oppose a restriction that does not allow a normalized form of the same content to exist without using HTML tags. Markdown provides a set of features which generally allow users to avoid falling back to HTML tags, and a Markdown processor which deviates from this goal feels incomplete. As a user, it would be frustrating to be told Markdown can be used, only to find that it can only sometimes be used.

sharwell avatar Dec 03 '18 14:12 sharwell

@sharwell thanks for your feedback here. As I mentioned in the initial issue description, TSDoc and Markdown have somewhat different requirements, which greatly complicates attempts to embed Markdown inside TSDoc. The two biggest hangups for me are:

  1. Predictability: TSDoc cannot assume that a person authoring a code comment will have an easy way to interactively preview how it gets rendered. For example, when they type a * character, it must be completely obvious whether this symbol means boldface, italics, list item, or a literal * character. (By contrast, every Markdown implementation has lots of "gotcha" behaviors that trip people up, but they just fiddle with the interactive preview until it comes out right.)

  2. Interoperability: The premise of TSDoc is that different tools must be able to process the same input and agree on its interpretation. In particular, they must exactly agree about questions such as "Does this @ character start a TSDoc tag or not?" (By contrast, CommonMark is more about improving consistency when people move between different projects that use different Markdown engines. In my experience it's pretty rare for the same input file to be production-rendered by two different Markdown engines. Even with CommonMark there are going to be incompatibilities if you attempt that. It's actually quite common for a single Markdown engine to evolve its own syntax in a way that breaks some existing content.)

Keep in mind that TSDoc is not just some English prose for humans to read. TSDoc goes in computer source code, and sometimes it contains tags that affect how a project gets built. Often it gets edited in a Git "merge conflict" editor that doesn't have any nice syntax highlighting.

I want to say that ">" blockquotes should not be supported at all, since the whitespace handling for these constructs is highly counterintuitive.

This form is widely used (it's even the default behavior when you click the quote formatting button in the GitHub editor), and everyone since the dawn of email knows that > at the beginning of a line means a quote.

That seems reasonable. But could you share an example of a realistic TypeScript code comment where someone would need to use >? For our own projects, the content that we write inside code comments (i.e. API reference) seems to be much simpler than what goes in a regular Markdown file (i.e. feature articles and tutorials). So for us at least, there is relatively less demand for advanced text formatting in TSDoc.

As a user, it would be frustrating to be told Markdown can be used, only to find that it can only sometimes be used.

I agree it's frustrating. But maybe it would be less frustrating than a realization that the syntax is not predictable ("I have no idea whether the stuff I'm writing will get rendered correctly by whatever documentation tool runs in this particular repo") or not interoperable ("I marked this API as @beta -- but we accidentally shipped it to production because I didn't understand some esoteric Markdown grammar rule.)

That said, I'm not being dogmatic about this. We modeled TSDoc as an open standard specifically to solicit your input and ideas. :-) A lot of these debates seem to get settled when we switch from philosophy and design, and instead look at specific real-world documentation problems that turn up. For example https://github.com/Microsoft/tsdoc/issues/128 was fairly enlightening for me personally.

By end of 2018, our API Extractor tool will have fully implemented all the core features of TSDoc (including declaration references) and processed a fairly large corpus of Microsoft APIs. When we write up the spec proposal, I want it to include real-world examples for each design decision.

octogonz avatar Dec 06 '18 20:12 octogonz

That seems reasonable. But could you share an example of a realistic TypeScript code comment where someone would need to use >?

@pgonzal The only case where it's come up for me to date is here: https://github.com/tunnelvisionlabs/antlr4ts/pull/393/commits/334007dcc6d9a04d61d2f14084533145d2f96fba

Historically, the other thing I've used the quote syntax for is arguably improper, e.g. callouts like this:

⚠️ This method likely does not behave as you expect.

It's the best translation I could think of at the time for what I would prefer to write with C#'s <note> element:

/// <note type="warning">
/// <para>This method likely does not behave as you expect.</para>
/// </note>

sharwell avatar Dec 06 '18 21:12 sharwell