node-rss icon indicating copy to clipboard operation
node-rss copied to clipboard

Strings are not conforming to the RSS spec for valid chars.

Open ErisDS opened this issue 9 years ago • 7 comments

The RSS spec specifies exactly which characters are considered valid in RSS:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

At present, this library doesn't handle ensuring that the strings it outputs conform to the spec. This means that the RSS feeds that are generated can easily become broken. We're having this problem in Ghost, when users copy & paste data from elsewhere - things like form feed and other control characters are completely invisible, but cause the RSS feed to become invalid & unusable.

There is some interesting information around about fixing this sort of problem:

http://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters http://stackoverflow.com/questions/2670037/how-to-remove-invalid-utf-8-characters-from-a-javascript-string

And here's an example regex that I have been trying out for fixing the issue:

/(?![\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD])./g

Here it is in action:

https://regex101.com/r/pQ7aB6/1

I have a branch with this implemented in Ghost, and it seems to work ok: https://github.com/ErisDS/Ghost/commit/7acb3f9df3e7f2cec54eae8173de6a3947bfaaf8

This seems to work well, the only question is whether the regex is a bit too naive / slow / memory intensive for use in a library like node-rss?

I'd be happy to PR a fix to node-rss, but interested to get some feedback on the regex and whether a different approach might be better.

ErisDS avatar May 21 '15 12:05 ErisDS

Would be great to get some feedback on this, and see if we could move it forward.

ErisDS avatar Oct 15 '15 15:10 ErisDS

We just had this issue and would appreciate something in place to stop it happening again - not really an easy bug to track down!

BenDTU avatar Oct 24 '16 04:10 BenDTU

Please see my comment / issue raised over on the Ghost Blog. https://github.com/TryGhost/Ghost/issues/8442

@ErisDS @BenDTU @dylang ?

alexellis avatar May 09 '17 17:05 alexellis

@alexellis as you can see - no-one has ever tested & verified my fix, therefore I haven't pushed it. I am the maintainer of this repo now, so if you'd like to help out and verify that the fix here is sane I can get it merged 👍

ErisDS avatar May 09 '17 17:05 ErisDS

@ErisDS I'd suggest writing a unit test for it 👍 - are there any in this project already?

alexellis avatar May 09 '17 18:05 alexellis

That's all well and good - but a unit test is a self-fulfilling prophecy, it only makes sense when you're certain the concept is correct, which I am not ;)

ErisDS avatar May 09 '17 18:05 ErisDS

image

I've done some benchmark with jsbenchit (https://jsbenchit.org/?src=3757b1fdb0aae6d3d72642d9468d6a66), and IMHO looks ok to use regex.

I actually ran the test thinking that using a regex would be slower than using a for loop. But as it turns out, regex is better.

It's a browser-based benchmark, so other JS runtimes might have different results. In any case, I don't think it has a significant impact on performance.

p.s. I'm user of ghost.io and having a hard time from non-printable chars 🥲

sh-cho avatar Jun 27 '23 13:06 sh-cho