micromark-extension-directive
micromark-extension-directive copied to clipboard
Non-ASCII directive names
Initial checklist
- [X] I read the support docs
- [X] I read the contributing guide
- [X] I agree to follow the code of conduct
- [X] I searched issues and couldn’t find anything (or linked relevant results below)
Problem
I write text files using an extended markdown syntax with a flavour for specific needs. Those text files are not in Latin script. I want to keep them in a uniform language without formatting prompts in English.
Markdown in general appears to have a language-independent syntax. ASCII-limited directives bring language-dependence.
Specific example
I am a Ukrainian speaker, creating a project for the local community with no internationalisation need in the future. I want to keep files in my native language as much as possible and have syntax as simple as possible.
My text files are songs. Sometimes, they contain a chorus that repeats after each verse (paragraph). Take a timely example:
Dashing through the snow
In a one-horse open sleigh
O'er the fields we go
Laughing all the way
Bells on bob tail [sic] ring
Making spirits bright
What fun it is to ride and sing
A sleighing song tonight! Oh!
:::chorus
Jingle bells, jingle bells,
Jingle all the way.
Oh! what fun it is to ride
In a one-horse open sleigh. Hey!
Jingle bells, jingle bells,
Jingle all the way;
Oh! what fun it is to ride
In a one-horse open sleigh.
:::
A day or two ago
I thought I'd take a ride
And soon, Miss Fanny Bright
Was seated by my side,
The horse was lean and lank
Misfortune seemed his lot
He got into a drifted bank
And then we got upsot.
A day or two ago,
The story I must tell
I went out on the snow,
And on my back I fell;
A gent was riding by
In a one-horse open sleigh,
He laughed as there I sprawling lie,
But quickly drove away. Ah!
Now the ground is white
Go it while you're young,
Take the girls tonight
and sing this sleighing song;
Just get a bobtailed bay
Two forty as his speed
Hitch him to an open sleigh
And crack! you'll take the lead.
My custom script detects the chorus and repeats it after each paragraph. However, chorus
in Ukrainian is приспів
and I would love to keep that native word in a Ukrainian text.
Solution
Configurable naming limitations.
Alternatives
- Find and replace before directive parsing.
- A forked parser with a patch
Here I shared my specific problem. I don't object to the current implementation with the imposed limitations backed up with solid reasoning in the readme about spacing and trailing colons.
I would love to understand the rationale behind limiting the directive naming.
@wooorm may be able to offer more context. From reviewing the description/spec https://talk.commonmark.org/t/generic-directives-plugins-syntax/444 I believe the intent is to be roughly compatible with html/custom element naming conventions https://html.spec.whatwg.org/multipage/custom-elements.html#valid-custom-element-name https://developer.mozilla.org/en-US/docs/Web/API/CustomElementRegistry/define#valid_custom_element_names which require the sequence start with an ASCII character (the difference being that directives do not require a dash).
The reason the current state is the way it is, is so that I didn’t have to decide.
Custom elements looks like a good thing to be compatible with. Although I don’t think a) the -
, b) the disallowed uppercase, c) the disallow list such as font-face
and such needs to be enforced. That is to say: it’s not bad if we allow some names that aren’t strictly compatible with HTML custom elements.
I wonder whether we need to enforce the disallowed ASCII punctuation/symbols though. I can see $
being useful, as it’s in JS too. Putting say (
or '
or /
or ;
in there seems weird. Although, as HTML allows much of those characters in attribute names, perhaps we can allow them too? Otherwise we should have different handling for “tag” names and “attribute” names.
Maybe simplest is to allow all unicode characters that are not unicode whitespace? https://github.com/micromark/micromark/blob/929275e2ccdfc8fd54adb1e1da611020600cc951/packages/micromark-util-character/dev/index.js#L232
@wooorm and @ChristianMurphy thank you for sharing your details. I also have assumed ~custom-elements~ (rather) HTML elements naming convention but I wanted to clarify this. If this is not a strict requirement, I would appreciate a change.
Thinking of a potential solution, character ranges listed in the HTML standard for custom element names seem to be reasonable to me. The PCENChar (potential custom element name character) is quite wide; it seems to allow all "alphabets", including characters needed in my case.
PCENChar ::=
"-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
Yet, it is beyond the proposed simplest solution and still enforces some limits. What do you think?
Script, I used to preview ranges
I am not knowledgeable in the Unicode char ranges, so I asked ChatGPT what range numbers mean (extended Latin, Japanese, Greek, Cyrillic etc) and reviewed the list manually using a script.
// "-"
// "."
// [0-9]
// "_"
// [a-z]
chars.push(String.fromCharCode(0xB7))
for (let i = 0xC0; i <= 0xD6; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xD8; i <= 0xF6; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xF8; i <= 0x37D; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x37F; i <= 0x1FFF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x200C; i <= 0x200D; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x203F; i <= 0x2040; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x2070; i <= 0x218F; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x2C00; i <= 0x2FEF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x3001; i <= 0xD7FF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xF900; i <= 0xFDCF; ++i) chars.push(String.fromCharCode(i))
for (let i = 0xFDF0; i <= 0xFFFD; ++i) chars.push(String.fromCharCode(i))
for (let i = 0x10000; i <= 0xEFFFF; ++i) chars.push(String.fromCharCode(i))
console.log(chars.join('\n'))
Some more considerations:
- Allowing most custom element names is indeed nice, but it’s not a goal to only support custom element names. Directives are not only useful for HTML. One existing example is that Docusaurus treats them as an alternative to JSX. Meaning the names should also be able to match (most) JS identifiers.
- In HTML, tag names and attribute names can match basically anything, because these names can only occur in special places. The
<
and whitespace and=
and/
and>
are very strong indicators of where the parser is. In markdown, this is more complex. Is:a*b*
ana
directive followed by emphasis or ana*b*
directive? Is:a$b$
ana$b$
directive, or isb
math, when enabled?
Custom elements allow basically all higher-than-ascii punctuation, and in the ASCII range -
, .
, _
.
JavaScript identifiers do not allow most punctuation, but allow $
and _
in the ASCII range.
In markdown, all ASCII punctuation either already is something in CM (_
) or could be something (such as $
for math).
So I’d prefer starting with few ASCII punctuation, we can expand later:
- Disallow all whitespace/controls
- Disallow ascii punctuation, except allow
.
,-
,_
- Allow the rest (basically alphanumerical and higher-than-ascii punctuation)
basically alphanumerical and higher-than-ascii punctuation
@wooorm do you have \w
in mind or anything else?
I have found that /[\p{L}\p{N}][\p{L}\p{N}.-_]*/u
might work just fine, where \p{N}
is a Unicode number, and \p{L}
is a Unicode letter (docs, look for # General_Category).
This may be expanded to:
export const unicodeAlphanumeric = regexCheck(/[\p{L}\p{N}]/u)
If we come to an agreement, I could prepare a pull request. What do you think?
We already have the parts in micromark. I think this is fine:
const fine = code <= codes.del
? code === codes.dash ||
code === codes.dot ||
code === codes.underscore ||
asciiAlphanumeric(code)
: classifyCharacter(code) !== constants.characterGroupWhitespace
Using asciiAlphanumeric
from micromark-util-character
, classifyCharacter
from micromark-util-classify-character
, and codes
and constants
from micromark-util-symbol
!
Note I think similar rules need to be applied to attribute names. They are a bit more complex because say .a.b
is already a shortcut for two classes.
Attributes are also prohibited from starting with an ASCII number (they’re currently only accepting ASCII too). I wonder if that’s needed.