RSParser RSS feed broken if CDATA contains lower ascii characters

Hi there,

I just stumbled upon a feed that uses chars in the range \0x01 - \0x1F (CDATA description). Although libxml2 isn't supposed to handle this, RSParser will break early and drop the remaining feed articles. When parsing the RSS below, only the first two items will be returned.

It should be enough to regex and replace these, however, I was wondering if there is a libxml2 flag that could be used instead…

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0">
<channel>
	<title>Feed Title</title>
<item>
		<title>1</title>
		<link>http://someurl.com/1/</link>
		<description><![CDATA[Description of first]]></description>
</item>
<item>
		<title>2</title>
		<link>http://someurl.com/2/</link>
		<description><![CDATA[Description with  \0x04 values]]></description>
</item>
<item>
		<title>3</title>
		<link>http://someurl.com/3/</link>
		<description><![CDATA[Description of third]]></description>
</item>
<item>
		<title>4</title>
		<link>http://someurl.com/4/</link>
		<description><![CDATA[Description of fourth]]></description>
</item>
<item>
		<title>5</title>
		<link>http://someurl.com/5/</link>
		<description><![CDATA[Description of fifth]]></description>
</item>
	</channel>
</rss>

Mar 05 '19 06:03 relikd

Here is the snipped I used:

[_xmlData enumerateByteRangesUsingBlock:^(const void * bytes, NSRange byteRange, BOOL * stop) {
    NSUInteger max = byteRange.location + byteRange.length;
    for (NSUInteger i = byteRange.location; i < max; i++) {
        unsigned char c = ((unsigned char*)bytes)[i];
        if (c < 0x20 && c != 0x9 && c != 0xA && c != 0xD) {
            ((unsigned char*)bytes)[i] = ' '; // replace lower ascii with blank
        }
    }
}];

E.g., with a class variable or flag and can be postponed until feed is about to be parsed. Let me know if this is a dumb idea, or if it has unforeseeable consequences.

Mar 06 '19 01:03 relikd

That might be the way to go, though I would do performance tests first to make sure it doesn’t have an impact.

It’s also possible that there’s some kind of way to tell libxml2 to ignore these. (I just haven’t looked yet.)

Mar 06 '19 01:03 brentsimmons

I don't think you want to do most of that range math; AIUI the bytes array always starts at a 0 index, not relative to the whole data (since it's a pointer to an arbitrary location in the data already). So just:

    for (NSUInteger i = 0; i < byteRange.length; i++) {

Other than that, seems fine to me!

Nov 27 '19 20:11 Wevah

Oh, you might also be able to specify const unsigned char *bytes as the type in the block declaration to avoid all the casting (if the compiler doesn't complain).

Not sure if the expected constness of the pointed-to bytes will cause issues when mutating, though it might be fine since you're not changing the length. (Copying to a new mutable data should be safer if that's a concern.)

Nov 27 '19 21:11 Wevah

RSParser RSParser copied to clipboard

RSS feed broken if CDATA contains lower ascii characters

RSParser
RSParser copied to clipboard