JSONFeed icon indicating copy to clipboard operation
JSONFeed copied to clipboard

Worth noting: No newlines in JSON strings

Open snej opened this issue 7 years ago • 1 comments

If the history of RSS is any guide, people are going to be writing code that generates JSON feeds by ad-hoc string concatenation or template substitution, without going through a real JSON encoder. And they're going to make mistakes that result in invalid JSON, most likely when writing article bodies.

JSON parsers will generally barf on these, which should mean that most of these mistakes get caught in casual testing before being released into the wild, but the different parsers vary in strictness, so it's possible someone will test with a more lenient parser and then their feed(s) will fail for others. Or the mistakes might only occur in some cases that aren't hit during testing.

There are two things I think are worth calling out in the spec:

JSON strings cannot contain newlines or tabs — they must be escaped as \n or \t. (The RFC requires that all control characters be escaped.) Some parsers seem not to mind if this is violated, but some do.

JSON has some very specific rules for how to escape Unicode characters. If someone uses a different library to do the encoding, the results may work most of the time but not always; for example Latin characters might make it through OK but not non-Roman ones. Again, this might slip past the kind of rudimentary testing that a lot of web-devs do (I'm talking about you, PHP kiddies.) For example, I've found that JSON-encoding NSStrings is tricky because NSString's "characters" are not Unicode codepoints but rather UTF-16-encoded values, and if you don't wrap your head around that, lots of higher-Unicode characters come out wrong. (Actually, the popularity of emoji is a real boon here, as emoji represent the most complex case of Unicode character encoding; so if you don't get the escaping correct, emoji tend to break, which is quickly apparent in real world use.)

The best advice for escaping Unicode is probably "don't do it." The spec clearly says that only double-quote, backslash and control characters need to be escaped. Everything else can appear literally in a string.

snej avatar May 17 '17 20:05 snej