commonmark-spec
commonmark-spec copied to clipboard
Allow optional namespace before tag name.
This is a proposed change to the markdown spec to allow namespaces within HTML tags.
It should be noted:
- Our tag name definition is a subset of allowable tag names as per https://www.w3.org/TR/xml-names/#ns-decl
- HTML5 disallows namespaces in tags, while XML/XHTML allows it.
- The attribute name definition in the spec already allows namespaces.
- The chance of impact to existing markdown documents is extremely low.
- Some parsers already allow namespaces in tags.
The motivation for this is to allow XML to pass through markdown documents unchanged. This can include things like SVG, MathML, etc. In my particular case, I have a post processing step which expands namespaced tags into templates.
An alternative to this proposal could be:
A [tag name](@) consists of an ASCII letter
followed by zero or more ASCII letters, digits,
hyphens (`-`) or colons (`:`).
That proposal is more similar to how attributes are defined, and easier to parse, but also not completely accurate w.r.t. the XMLNS specification.
See https://github.com/gjtorikian/commonmarker/pull/123 for a potential implementation.
cc @kivikakk
- The attribute name definition in the spec already allows namespaces.
This especially makes me inclined to adjust the spec; at the very least we should be consistent, and I imagine it would be rare indeed that CommonMark users are including text like <a:b> in their documents and relying on it being (essentially) escaped for them, as currently happens.
@jgm Thoughts? This is a very small and low-impact change.
I've been trying to find some direct examples which might support this proposal.
In the current spec, there is a mention of DocBook. Well, it turns out the latest iteration of DocBook uses namespaces.
https://docbook.org/docs/howto/howto.html#introduction-ns
In addition, I can think of use cases combining things like Atom (XML syndication format) with Markdown to generate a feed. So, I believe there are sufficient use cases and existing standardisation to support this proposal.
The chance of impact to existing markdown documents is extremely low.
As your babelmark example shows, commonmark conforming parsers treat <foo:bar> as an autolink. Where foo: is the scheme, and anything after it the rest. I don’t think it would interfere much, e.g., autolinks that typically have other characters in there that can’t be tag names, such as <mailto:[email protected]>, <tel:5551234567>, or <http://localhost>.
Good catch! Technically, it should be possible to disambiguate links from tags because tags should either be self-closing or or have a matching closed tag.
That being said, I think it's an important consideration. Playing around with the issue you raise, it appears this issue exists even without namespaced tags: https://babelmark.github.io/?text=%3Csamuel%40oriontransfer.net%3E requires the parse to have some basic level of disambiguation (i.e. as defined by the spec).
Maybe additional disambiguation rules are required for <...> style links, e.g. playing with babel:
https://babelmark.github.io/?text=%3Cht%3A%2F%2Fwww.google.com%3E
vs
https://babelmark.github.io/?text=%3Ch%3A%2F%2Fwww.google.com%3E
Not sure what logic is being applied here.
The scheme must be 2 or more characters! https://spec.commonmark.org/0.29/#scheme
Text between < and > that looks like an HTML tag is parsed as a raw HTML tag and will be rendered in HTML without escaping. Tag and attribute names are not limited to current HTML tags, so custom tags (and even, say, DocBook tags) may be used.
Do you think this takes precedence over autolinks?
Maybe one solution, as we have a list of block level HTML tags, is to have a list of recognised schemes for <...> style links. This would fix any ambiguities caused by this syntax. It can be extensive but pre-determined.
I personally don’t prefer adding another list of supported values to CM, as it increases the minimum memory/size footprint of all conforming markdown parsers. For precedence between HTML or autolinks, it’s currently undefined I think in the spec. I believe no current autolink is also a valid html tag. And reverse. This would change that.
For docbook: having an namespace (such as given with xmlns) doesn’t necessarily require prefixes (such as in tag names, svg:rect, or attributes, xml:lang). HTML, SVG, and mathml don’t need prefixed tag names either (attributes only). Could you expand on this part you mentioned above?:
I have a post processing step which expands namespaced tags into templates.
I would personally rather remove constructs that aren’t used in HTML anymore: such as CDATA and processing instructions, than start supporting more XML
Generally I'm very sympathetic to the request; however, there's the issue of autolinks.
See https://talk.commonmark.org/t/what-is-the-point-of-limiting-uri-schemes-in-autolinks/555/13 for an earlier discussion of this point (when we simplified the definition of absolute URI by not hard-coding a list of schemes, which we used to do). At that point I raised the issue of namespaces in XML tags. It was observed that one-character namespaces would still not be interpreted as URIs. I don't know if having one-character namespaces is enough for people's purposes.
Thank's for everyone's feedback.
I don't know if having one-character namespaces is enough for people's purposes.
My tag structure needs to use more elaborate namespaces, e.g.
<content:youtube-video id="..." />
<gallery:photos path="..." />
The namespaces are "mounted" by the rendering engine and cause tag expansion before generating the final HTML. So naturally, I want to write some markdown like this:
## Introduction Video
<content:youtube-video id="..." />
I'm okay with the following, completely unambiguous forms:
<namespace:tagname />
<namespace:tagname attr="value"/>
and
<namespace:tagname>...</namespace:tagname>
However to disambiguate the latter, the block parser would need to scan ahead and the inline parser would need to look for a matching closing tag.
The need for a space on the self closing form is a bit of a hack, but acceptable, because in most cases, if you provide attributes, it's not needed.
One other option is we could expect users to provide a list of namespaces which could be available as metadata:
e.g.
[content]: x-internal-content
[gallery]: x-internal-gallery
[math]: http://www.w3.org/1998/Math/MathML
This would completely disambiguate the parse.
@jgm can I get your feedback on the above proposal?
The spec has declarative style: it says, "such and such is an X." Not: first, try parsing as an X, and if that doesn't work, try as a Y. So the ambiguity of <foo:bar> between a tag and an autolink interpretation is a problem, even if you personally would be content if <foo:bar> always resolved as an autolink.
So I don't yet see a proposal that would work with the current spec. We could easily modify things to allow one-character namespaces, but you don't seem satisfied with that.
Making parsing parameterized on a list of namespaces provided by the user would also break the style of the spec -- currently it's self-contained and doesn't depend on externally provided lists -- unless the list is defined in the document itself. I take it that's what you were gesturing at with
[content]: x-internal-content
[gallery]: x-internal-gallery
[math]: http://www.w3.org/1998/Math/MathML
but of course this already has a clearly defined meaning in commonmark (reference link definitions).
I wonder whether your needs might be met in other ways? For example,
<? content:youtube-video id="..." ?>
<? gallery:photos path="..." ?>
are already parsed as raw HTML.
allow one-character namespaces, but you don't seem satisfied with that.
It means standard way for everything else and a different way for markdown. I cannot copy code from existing document and insert it into markdown without rewriting the namespaces.
The spec has declarative style: it says, "such and such is an X." Not: first, try parsing as an X, and if that doesn't work, try as a Y. So the ambiguity of foo:bar between a tag and an autolink interpretation is a problem, even if you personally would be content if foo:bar always resolved as an autolink.
Fair enough, it makes sense. I assume you mean that the parser should not need to backtrack?
I was under the (maybe wrong) impression that auto links already have to deal with this, i.e. if there is whitespace, it cannot be an auto-link. As I suggested, I'd be happy with this, i.e. <ns:name /> is an acceptable tradeoff.
but of course this already has a clearly defined meaning in commonmark (reference link definitions).
Yes, namespace are a kind of link, so I think this usage would be reasonable. Yes, it would need to be self contained and preceed the usage IMHO. Maybe a different syntax would be okay, e.g. following the xmlns: style:
[xmlns:math]: http://www.w3.org/1998/Math/MathML
Generators that understand XML could use these links when generating the output document.
Nested HTML
One solution which appears to work is something like this:
<div>
<hello:world/>
</div>
However I would like to avoid inserting additional divs. It also doesn't work for inline elements:
Some inline tags <span><hello:world /></span>
Taking your example, why not allow something like this:
<?xml version="1.0" encoding="UTF-8" ?>
<hello:world />
or more specifically
<?my-processing-engine ?>
<hello:world />
It should extract one entire HTML block.
I don't know how you could do it for inline elements.
I think using <? ?> for tags is both wrong and liable to introduce more incompatibilities in the future, given that PI have a specific purpose.
Summary
Here are the possible options:
- Do nothing. Some compatibility (maybe confusing) when nested in HTML.
- Introduce PI style tags, maybe problem with existing PI tags.
- Limit auto-links to specific schemes, to avoid clash with XMLNS tags.
- Explicitly list namespaces using links (or something similar) so that parsing can be disambiguated.
- Introduce invalid auto-link characters for disambiguation, e.g. whitespace,
=, etc. - Change baseline assumptions to HTML5 and drop support for any unsupported constructs, PI, CDATA, namespaces, etc, maybe break backwards compatibility.
For 2c, here is what I think:
- I agree removing all non-HTML5 features generally, as a direction, but this will potentially break both backwards and forwards compatibility. HTML5 is also a pretty massive compromise when it comes to actual structured documents. It depends on the direction of Markdown. In my experience, HTML5 has a lot of exceptions that make it hard to parse. XML is a lot easier to parse/validate and is far more predictable/standardised.
- I think limiting auto-links to known schemes makes sense, as it seems too ambiguous to me, and maybe a security issue (a whitelist is definitely a good idea). i.e. should we be allowing things like
<javascript:console.log%28%22Hello%20World%22%29>? - Extending links for the purpose of namespaces seems pretty natural to me, but might cause incompatibilities unless the syntax was different
[xmlns:content]: x-internal-content, etc.
Is Markdown an exclusively HTML output format, or do we want to support other kinds of mappings? i.e. should we be able to generate other formats (DocBook, SVG, MathML, etc). Because these formats have provisions for and practically speaking include XMLNS as part of their specification. So either we go all in with "HTML5" as the baseline and reject the above formats, or we try to figure out how to be compliant with XML/XMLNS so that the above formats fit naturally and without "markdown specific" adaptations (i.e. using PI "tags").
I reviewed in my own code what would be required to use 1-character namespaces.
Firstly, we'd need some way in the markdown document to hook this up, e.g.
[xmlns:c]: x-utopia-content
<c:youtube-video id="..."/>
Otherwise there is no way to attach the logic without making assumptions.
Internally, the code would look like:
def process_tag(namespace, tagname)
if document.links[namespace] == 'x-utopia-content'
render_tag(tagname, ...)
end
end
So we would need to expose enough bits (e.g. xmlns:c link) in order to make sense of the syntax being used without hard coding it.
I've made a fork to experiment with the changes and since this is blocking downstream work, I'll use this fork in my own projects so I can gain some experience with the potential changes and report back.
https://github.com/ioquatix/markly
Only two specs failed, and they both relate to the usage of single character namespaces. So either the current spec is under-specified, my implementation is not working as expected, or the changes (allowing : in a tag name) are not in conflict with any of the current examples in the spec.
@ioquatix Could you solve this by doing the inverse? Right now, you have XML embedded in Markdown. What if you have a proper XML document, parse it, and then treat some elements inside it as Markdown?
This all depends on why you ran into this problem. What you’re doing.
(depending on what you’re trying to do, you may also be able to use the xmlns attribute instead of a prefix in the tag name, no?)
I agree removing all non-HTML5 features generally, as a direction, but this will potentially break both backwards and forwards compatibility.
True! I feel the spec is currently pulled into a split between XML and HTML. In some cases, XML things are supported and not all HTML is, but in other, HTML semantics are applied and not all XML is supported. I think it would make sense to go in a clear direction. And the reason I prefer HTML is because it is, similar to Markdown, a format made for authors, without errors. Whereas XML is strict and with errors.
I think limiting auto-links to known schemes makes sense, as it seems too ambiguous to me, and maybe a security issue (a whitelist is definitely a good idea). i.e. should we be allowing things like
<javascript:console.log%28%22Hello%20World%22%29>?
Yes, all of this is allowed, though many implementations (including the reference implementations) are "safe by default" and only pass this through as a link if you specify the "unsafe" option. Forbidding a scheme in autolinks is pretty weak security -- you could always use it in a regular link. Forbidding a scheme in any links is also pretty weak security -- if you're worried about security, there are many other things you need to consider as well.
I was under the (maybe wrong) impression that auto links already have to deal with this, i.e. if there is whitespace, it cannot be an auto-link. As I suggested, I'd be happy with this, i.e.
<ns:name />is an acceptable tradeoff.
It's not about parsing, it's about specifying. The spec says: such-and-such counts as inline HTML. So we have an ambiguity problem if the same string also counts as an autolink. We could solve that by specifying a precedence explicitly in the spec (ugly, as it goes against the declarative style of the spec), or by making the spec for inline HTML more complex so that <foo:bar> doesn't count but <foo:bar /> does (adds complexity and makes the spec harder to uderstand). It's not that this couldn't be done, but there's a cost to it.
I agree with all your points, but I also have the same problem I started with. So, let's explore with the fork for now. It seems that making the single change didn't break a single spec unexpectedly, so at least we are in a situation where we should add more specs to break my fork, or add more specs to support it's existance.
There is one spec that fails somewhat expectedly:
1) Failure:
TestSpec#test_html_renderer_example_617 [/home/samuel/Documents/ioquatix/markly/test/test_spec.rb:20]:
<m:abc>
.
Expected: "<p><m:abc></p>"
Actual: "<m:abc>"
But this seems like the specific specification of single character namespaces as outlined above. Is it currently explicitly denied? Was that the intention?
So far in my fork I have not encountered any unexpected behaviour, and I feel as if it's working as I'd expect w.r.t. namespaced tags.
@ioquatix Do you need namespaces to work for HTML blocks or HTML inlines (or both)?
If HTML blocks is enough, I think the ambiguity problem goes away. Block parsing happens before inline parsing, so if the block is parsed as a HTML block, no inline parsing will happen and content in the block can not be confused with autolinks.
It would mean that the definition for open tag is different for HTML blocks and inline blocks, but maybe that's acceptable.
I guess I want both. But I understand where you are coming from and maybe it's a good first step. That being said, my modified parser doesn't seem to have any issues thus far with "real world" markdown.
I am still using this in production on an admittedly small dataset and haven't run into any issues. I think limiting auto-linking to a well-known (perhaps configurable per parser) subset of schemes is a logical approach.
The only change required was adding : to the allowed tag name parser. Based on the implementation of cmark-gfm, it did not cause any issues with the auto-link detection.