tsdoc
tsdoc copied to clipboard
RFC: Two possibilities for integrating Markdown into TSDoc
Problem Statement
There are numerous incompatible Markdown flavors. For this discussion, let's assume "Markdown" means strict CommonMark unless otherwise specified.
Many people expect to use Markdown notations inside their JSDoc. Writing a Markdown parser is already somewhat tricky, since the grammar is highly sensitive to context (compared to other rich-text formats such as HTML). Extending it with JSDoc tags causes some interesting collisions and ambiguities. Some motivating examples:
1. Code fences that span tags
/**
* I can use backticks to create a `code fence` that gets highlighted in a typewriter font.
*
* This `@tag` is not a TSDoc tag, since it's inside a code fence.
*
* This {@link MyClass | hyperlink to the `MyClass` base class} should get highlighting
* in its target text.
*
* This {@link MyClass | example of backtick (`)} has an unbalanced backtick (`)
* inside a tag.
*/
Intuitively we'd expect it to be rendered like this:
I can use backticks to create a code fence that gets highlighted in a typewriter font.
This @tag is not a TSDoc tag, since it's inside a code fence.
This hyperlink to the MyClass base class should get highlighting in its target text.
This example of backtick (`) has an unbalanced backtick (`) inside a tag.
2. Stars
Stars have the same problems as backticks, but with even more special cases:
/**
* Markdown would treat these as
* * bullet
* * items.
*
* Inside code comments, the left margin is sometimes ambiguous:
** bullet
** items?
*
* Markdown confusingly *allows a * inside an emphasis*.
* Does a *{@link MyClass | * tag}* participate in this?
*/
Intuitively we'd expect it to be rendered like this:
Markdown would treat these as
- bullet
- items.
Inside code comments, the left margin is sometimes ambiguous:
- bullet
- items?
Markdown confusingly allows a * inside an emphasis. Does a * tag participate in this?
3. Whitespace
Markdown assigns special meanings to whitespace indentation. For example, indenting 4 spaces is equivalent to a ``` block. Newlines also have lots of have special meanings.
This could be fairly confusing inside a code comment, particularly with weird cases like this:
/** Is this indented? */
/** some junk
Is this indented? */
/**
Is this okay at all? */
/**
Is this star part of the comment?
* mystery
Or is it a Markdown bullet?
*/
Perhaps TSDoc should issue warnings about malformed comment framing.
Perhaps we should try to disable some of Markdown's indentation rules. For example, the TSDoc parser could trim whitespace from the start of each line.
4. Markdown Links
Markdown supports these constructs:
[Regular Link](http://example.com)
[Cross-reference Link][1]
. . .
[1]: http://b.org

Autolinks are handy: http://example.com
However if you want an accurate URL-detector, it turns out to be a fairly big library dependency.
The Markdown link functionality partially overlaps with JSDoc's {@link} tag. But it's missing support for API item references.
5. Markdown Tables
Markdown tables have a ton of limitations. Many constructs aren't supported inside table cells. You can't even put a newline inside a table cell. CommonMark had a long discussion about this, but so far does not support the pipes-and-dashes table syntax at all. Instead it uses HTML tables. This seems pretty wise.
6. HTML elements
Most Markdown flavors allow HTML mixed into your content. The CommonMark spec has an entire section about this. This is convenient, although HTML is an entire separate grammar with its own complexities. For example, HTML has a completely distinct escaping mechanism from Markdown.
Here's a few interesting cases to show some interactions:
/**
* Here's a <!-- @remarks --> tag inside an HTML comment.
*
* Here's a TSDoc tag that {@link MyClass | <!-- } seemingly starts --> an HTML comment.
*
* The `@remarks` tag normally separates two major TSDoc blocks. Is it okay for that
* to appear inside a table?
*
* <table><tr><td>
* @remarks
* </td></tr></table>
*/
Two Possible Solutions
Option 1: Extend an existing CommonMark library
The most natural approach would be for the TSDoc parser to include an integrated CommonMark parser. The two grammars would be mixed together. We definitely don't want to write a CommonMark parser from scratch, so instead the TSDoc library would need to extend an existing library. Markdown-it and Flavormark are possible choices that are both oriented towards custom extensions.
Possible downsides:
- Incorporating full Markdown into the TSDoc AST nodes implies that our doc comment emitter would need to be a full Markdown emitter. (In my experience, correctly emitting Markdown is every bit as tricky as parsing Markdown.)
- To support an entrenched backend with its own opinionated Markdown flavor, this approach wouldn't passthrough Markdown content from doc comments; instead the backend would have to parse AST nodes that were emitted back to Markdown. This can be good (if you're rigorous and writing a proper translator) or bad (if you're taking the naive route)
- This approach couples our API contract (e.g. the AST structure) to an external project
- Possibly increases friction for tools that are contemplating taking a dependency on @microsoft/tsdoc
Option 2: Treat full Markdown as a postprocess
A possible shortcut would be to say that TSDoc operates as a first pass that snips out the structures we care about, and returns everything else as plain text. We don't want to get tripped up by backticks, so we make a small list of core constructs that can easily screw up parsing:
- code fences (backticks)
- links
- CommonMark escapes
- HTML elements (but only as tokens, ignoring nesting)
- HTML comments (?)
Anything else is treated as plain text for TSDoc, and gets passed through (to be possibly reinterpreted by another layer of the documentation pipeline).
/**
* This is *bold*. Here's a {@link MyClass | link to `MyClass`}. <div>
* @remarks
* Here's some more stuff. </bad>
*/
Here's some pseudocode for a corresponding AST:
[
{
"nodeKind": "textNode",
"content": "This is *bold*. Here's a " // <-- we ignore the Markdown stars
},
{
"nodeKind": "linkNode",
"apiItemReference": {
"itemPath": "MyClass"
},
"linkText": [
{
"nodeKind": "textNode",
"content": "link to "
},
{
"nodeKind": "codeFenceNode", // <-- we parse the backticks though
"children": [
{
"nodeKind": "textNode",
"content": "MyClass"
}
]
},
{
"nodeKind": "textNode",
"content": ". "
},
{
"nodeKind": "htmlElementNode",
"elementName": "div"
}
]
},
{
"nodeKind": "customTagNode",
"tag": "@remarks"
},
{
"nodeKind": "textNode",
"content": "Here's some more stuff."
},
{
"nodeKind": "htmlElementNode",
"elementName": "bad", // <-- we care about HTML delimiters, but not HTML structure
"isEndTag": true
}
]
Possible downsides:
- The resulting syntax would be fairly counterintuitive for people who assume they're writing real Markdown. All the weird little Markdown edge cases would be handled oddly.
- This model invites a documentation pipeline to do nontrivial syntactic postprocessing. For content authors, the language wouldn't have a unified specification. (This isn't like a templating library that supports proprietary HTML tags. Instead, it's more like if one tool defined HTML without attributes, and then another tried to retrofit attributes on top of it.)
- We might end up having to code a small CommonMark parser (although it would be a subset of the work involved for a parser that handles the full grammar)
- How will the second stage Markdown parser accurately report line numbers for errors?
What do you think? Originally I was leaning towards #1 above, but now I'm wondering if #2 might be a better option.
TypeDoc takes the second approach. There are some special case situations where parsing needs to be markdown aware (e.g. code blocks) but most of the parsing can be passed to a true markdown parser.
When testing this out I noticed some bugs with how TypeDoc handles links and markdown. You can see how TypeDoc renders some of the examples above as well as how it handles the following links.
/**
* TypeDoc handles a square bracket syntax to link to [[MyClass]]
* with [[MyClass|pipe labeled links]] and [[MyClass space labeled links]]
*
* TypeDoc handles basic links to {@link MyClass}
* but {@link MyClass | labeled links} are broken.
*
* Code fences can expose parsed links `{@link MyClass}`
*
* ```
* As are code blocks with {@link MyClass} text
* ```
*/
export class MyClass {

For #2 and #3. TypeDoc is very forgiving. Generally it removes the first asterisk of a line. Whitespace is usually retained. If there is an empty (except the asterisk) line in between lines, a break is generated.


HTML is also supported causing some interesting results

Currently, JSDoc tags split into two categories: block tags and inline tags. However, there are only two inline tags ({@link} and {@tutorial}) both represents links.
I want to know is there any other inline tags being used today and what's the possibility that tsdoc will add more inline tags in the future? If both answers are no, can we just throw the whole "inline tags" concept away, and extend Markdown links to replace the two existing tags?
By doing this, for question 1, we don't need to worry about the collision between inline JSDoc tags and inline Markdown syntaxes; for question 4, we will have a single (instead of two) but powerful syntax to express links, and we can just rely on the Markdown parser to parse comments.
Then, the only remaining is the block tags. As the name suggests, I expect them to be the first non-whitespace token at their lines, to start a block. Anything between two tags (or the end of comment) belongs to the tag above them. This also answers #13.
For question 2 and 3, personally I want to enforce well-formed comments (from the second line, every line starts with exactly one star and exactly one whitespace), instead of some random lines without any gains.
@yume-chan It would be great to simplify the comment parsing. However, I think it may be surprising to JavaScript developers familiar with JSDoc to not support {@link} tags. I would expect TSDoc should support the text of existing doc comments as projects convert over to TypeScript. Additionally, markdown links don't seem well suited for the API links discussed in #9.
Getting back in the saddle, today we merged the PR that sets up the initial tsdoc parser library project. I'm starting with Option 2 and we'll see how that goes. I've been experimenting with different approaches for the tokenizer strategy and will follow up.
I want to know is there any other inline tags being used today and what's the possibility that tsdoc will add more inline tags in the future? If both answers are no, can we just throw the whole "inline tags" concept away, and extend Markdown links to replace the two existing tags?
The @include tag is another potential inline tag. For custom tags, I believe the {@tagname parameters...} is the only JSDoc-flavored way to allow custom tags with arbitrary parameters. The block tags (e.g. @tagname) can't have parameters because we don't know where their content ends.
Markdown links don't provide a generalized pattern for parameterized tags. Also their link target is somewhat vague ("zero or more characters") which might make it difficult to detect the rather elaborate reference syntax that @MartynasZilinskas was working on.
So, the big architecture of the parser would be like this:
- Extract comment lines from "
/** */" their blocks (see this PR) - Tokenize the content of the lines (e.g. "
<" symbol, chunk-of-text, etc.) - Parse the lines into pre-TSDoc AST nodes (e.g. CommonMark code fences, HTML elements, HTML comments, etc).
- For the non-escaped text, parse TSDoc block tags and inline tags
By pre-TSDoc I mean a conservative subset of the minimal CommonMark-compatible constructs that the TSDoc stage needs to understand in order to avoid accidentally parsing something like these examples:
/**
* Not a TSDoc tag: \@tag
* Not a TSDoc tag: `@tag`
* Not a TSDoc tag: <!-- @tag -->
* Not a TSDoc tag:
* ```
* @tag
* ```
* Not a TSDoc tag: <div data-text="@tag" />
*/
So the basic pre-TSDoc constructs would be:
- CommonMark escaped characters (backslash)
- CommonMark code fences/spans (backticks)
- HTML elements/attributes, but not inner text e.g.
<table><tr><td>@tag</td></tr></table>does contain a transformable TSDoc tag - HTML comments
These are questionable:
- CommonMark ATX headings (
# Heading) - CommonMark links including image links
- CommonMark emphasis characters (e.g.
**bold**or_italics_) - autolinks (e.g.
http://blarg/@tag)
These I'm proposing to NOT consider in the TSDoc stage (but a documentation tool's backend Markdown render is free to process them):
- CommonMark lists, blockquotes, breaks, etc.
- HTML escapes (e.g.
&,<.
See https://github.com/microsoft/tsdoc/issues/70#issuecomment-578536347