kramdown icon indicating copy to clipboard operation
kramdown copied to clipboard

CDATA sections are being HTML-escaped

Open cabo opened this issue 2 years ago • 6 comments

kramdown seems to treat CDATA like normal text in XML (HTML) parts of a markdown input.

$ kramdown -o html   

<figure anchor="xml_happy3">
  <artwork align="left" name="" type="" alt=""><![CDATA[
+-----------------------+
| Use XML, be Happy :-) |
|_______________________|
     ]]></artwork>
</figure>

^D
<figure anchor="xml_happy3">
  <artwork align="left" name="" type="" alt="">&lt;![CDATA[
+-----------------------+
| Use XML, be Happy :-) |
|_______________________|
     ]]&gt;</artwork>
</figure>

Actually, I cannot find any CDATA processing on the input side of the kramdown parser.

cabo avatar Jun 08 '22 00:06 cabo

Yes, CDATA sections are currently not supported.

gettalong avatar Jun 08 '22 21:06 gettalong

So what would it take to implement CDATA? They are an alternative to normal text nodes and can be mixed. They could also simply be resolved to text nodes, which means that the text data would then be escaped on output.

cabo avatar Jun 09 '22 06:06 cabo

@cabo I'm not very familiar with CDATA sections. Are you saying

  1. that the content part of a CDATA section <![CDATA[content]]> can be treated just like content with the assumption that everything in content is just text (so no XML/HTML elements)?
  2. And that they can be mixed with text like Some element <![CDATA[some <xml> here]]> other text?

gettalong avatar Mar 20 '23 09:03 gettalong

CDATA is just a way to avoid having to escape every XMLy character within a section of content.

You can treat CDATA sections as an extra node, like an XML parser would do, or you can dissolve the CDATA section into text, what is probably what makes more sense to kramdown ecosystem. (In the latter case, you also don't have to process it in writers etc.)

cabo avatar Mar 20 '23 11:03 cabo

@cabo I'm not very familiar with CDATA sections. Are you saying

  1. that the content part of a CDATA section <![CDATA[content]]> can be treated just like content with the assumption that everything in content is just text (so no XML/HTML elements)?

Yes.

  1. And that they can be mixed with text like Some element <![CDATA[some <xml> here]]> other text?

Yes. The "bug" for me is that the CDATA markup (<![CDATA[ and ]]>) stays in place and is even HTML-escaped. Instead you should treat just the content of that section as (unparsed) text content. (If you don't want to treat them specially as a CDATA node.)

cabo avatar Mar 20 '23 11:03 cabo

ChatGPT says: (slightly corrected by me):

CDATA stands for "character data" and is used in XML to enclose text that should be treated as raw character data, rather than markup.

In XML, markup symbols like '<' and '>' have special meanings and are used to define elements, attributes, and other structural components of the document. However, there may be cases when you want to include text that contains these symbols without them being interpreted as markup.

CDATA sections are a way to include such text in an XML document. They are enclosed within a pair of CDATA section markers that look like this: <![CDATA[ and ]]>. Any text within these markers is considered character data and is not parsed as XML markup.

For example, consider the following XML snippet:

<description>
   <![CDATA[
   <h2>Product Description</h2>
   <p>This is a <em>fantastic</em> product!</p>
   ]]>
</description>

In this example, the text within the CDATA section is <h2>Product Description</h2><p>This is a <em>fantastic</em> product!</p>. If this text were not enclosed in a CDATA section, it would be interpreted as markup[...]

CDATA sections can be used for any kind of character data that may contain special characters that could be misinterpreted as markup. Common use cases include including code snippets or scripts within an XML document, or including HTML content within an XML document.

cabo avatar Mar 20 '23 11:03 cabo