kramdown
kramdown copied to clipboard
CDATA sections are being HTML-escaped
kramdown seems to treat CDATA like normal text in XML (HTML) parts of a markdown input.
$ kramdown -o html
<figure anchor="xml_happy3">
<artwork align="left" name="" type="" alt=""><![CDATA[
+-----------------------+
| Use XML, be Happy :-) |
|_______________________|
]]></artwork>
</figure>
^D
<figure anchor="xml_happy3">
<artwork align="left" name="" type="" alt=""><![CDATA[
+-----------------------+
| Use XML, be Happy :-) |
|_______________________|
]]></artwork>
</figure>
Actually, I cannot find any CDATA processing on the input side of the kramdown parser.
Yes, CDATA sections are currently not supported.
So what would it take to implement CDATA? They are an alternative to normal text nodes and can be mixed. They could also simply be resolved to text nodes, which means that the text data would then be escaped on output.
@cabo I'm not very familiar with CDATA sections. Are you saying
- that the content part of a CDATA section
<![CDATA[content]]>
can be treated just likecontent
with the assumption that everything in content is just text (so no XML/HTML elements)? - And that they can be mixed with text like
Some element <![CDATA[some <xml> here]]> other text
?
CDATA is just a way to avoid having to escape every XMLy character within a section of content.
You can treat CDATA sections as an extra node, like an XML parser would do, or you can dissolve the CDATA section into text, what is probably what makes more sense to kramdown ecosystem. (In the latter case, you also don't have to process it in writers etc.)
@cabo I'm not very familiar with CDATA sections. Are you saying
- that the content part of a CDATA section
<![CDATA[content]]>
can be treated just likecontent
with the assumption that everything in content is just text (so no XML/HTML elements)?
Yes.
- And that they can be mixed with text like
Some element <![CDATA[some <xml> here]]> other text
?
Yes. The "bug" for me is that the CDATA markup (<![CDATA[
and ]]>
) stays in place and is even HTML-escaped.
Instead you should treat just the content of that section as (unparsed) text content.
(If you don't want to treat them specially as a CDATA node.)
ChatGPT says: (slightly corrected by me):
CDATA stands for "character data" and is used in XML to enclose text that should be treated as raw character data, rather than markup.
In XML, markup symbols like '<' and '>' have special meanings and are used to define elements, attributes, and other structural components of the document. However, there may be cases when you want to include text that contains these symbols without them being interpreted as markup.
CDATA sections are a way to include such text in an XML document. They are enclosed within a pair of CDATA section markers that look like this: <![CDATA[
and ]]>
. Any text within these markers is considered character data and is not parsed as XML markup.
For example, consider the following XML snippet:
<description>
<![CDATA[
<h2>Product Description</h2>
<p>This is a <em>fantastic</em> product!</p>
]]>
</description>
In this example, the text within the CDATA section is <h2>Product Description</h2><p>This is a <em>fantastic</em> product!</p>
. If this text were not enclosed in a CDATA section, it would be interpreted as markup[...]
CDATA sections can be used for any kind of character data that may contain special characters that could be misinterpreted as markup. Common use cases include including code snippets or scripts within an XML document, or including HTML content within an XML document.