feedvalidator icon indicating copy to clipboard operation
feedvalidator copied to clipboard

"&" sometimes recognized as markup within a CDATA section rather than character data

Open ScottG489 opened this issue 4 years ago • 0 comments

First I'd like to start by mentioning my assumption is that this is the code which powers the backend for https://validator.w3.org/feed/ and possibly https://www.rssboard.org/rss-validator/. However, the bug only occurs on the former website.

It seems that within certain elements that contain a CDATA section, if there is an ampersand (&) followed by a character that isn't a space, then the validator will report the following recommendation:

Invalid HTML: Named entity expected. Got none.

With a reference to this help doc.

The exact situation for this seems very specific. This doesn't reproduce for CDATA sections in all elements. Here is a minimal example that will reproduce the potential bug:

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Foo Bar</title>
    <link>https://example.com</link>
    <description><![CDATA[foo &bar]]></description>
    <atom:link href="https://example.com" rel="self" type="application/rss+xml"/>
    <item>
      <description><![CDATA[foo &bar]]></description>
<guid>http://example.com/123</guid>
    </item>
  </channel>
</rss>

The recommendation will be reported on line 8 (not 5) within the <description> nested within <item>.

I tried reproducing this issue within a <title> nested within <item> but it did not reproduce. I also tested within a <description> nested within <channel> and it also didn't reproduce. Perhaps there are other situations where it will reproduce but I've only been able to reproduce it with the CDATA section inside a <description> nested within an <item>.

This seems to indicate that in this specific context, it's recognizing the "&" as markup within a CDATA section rather than character data. However, the official documentation on the CDATA sections specifies that:

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using " < " and " & ". CDATA sections cannot nest.

Looking forward to hearing your thoughts.

ScottG489 avatar Aug 06 '21 17:08 ScottG489