JSONKit
JSONKit copied to clipboard
Handling Unicode Byte Order Marks
During deserialization, the current behaviour of JSONKit is to halt parsing with an "unexpected token" type error if a byte order mark is encountered at the beginning of the JSON data.
This is highly undesirable, as many text editors (windows Notepad being a prime example) used to manually produce JSON documents do not provide an option to omit a BOM when saving text encoded using Unicode. Thus, documents written using such editors cannot be parsed using JSONKit. Requiring document authors to manually remove BOMs using alternate software or to install a particular text editor is not an ideal solution.
Put another way, BOMs are commonplace and being unable to parse documents containing them is a serious limitation of JSONKit.
I propose adding another JKParseOption
flag (say, JKParseOptionConsumeBOMs
or similar) that would enable a mode in which leading BOMs are skipped. This allows users for whom the current behaviour is desirable to retain that behaviour (by not specifying JKParseOptionConsumeBOMs
), and users who expect BOMs to occur to be able to handle them transparently.
For reference, I have achieved the desired BOM-compatible behaviour by adding the following code after the NSCParameterAssert
in the _JKParseUTF8String
method (only tested with UTF-8 BOMs):
struct {
int length;
const unsigned char mark[4];
}
boms[] = {
{4, {0x00, 0x00, 0xfe, 0xff}}, // UTF-32 (BE)
{4, {0xff, 0xfe, 0x00, 0x00}}, // UTF-32 (LE)
{3, {0xef, 0xbb, 0xbf, 0x00}}, // UTF-8
{2, {0xfe, 0xff, 0x00, 0x00}}, // UTF-16 (BE)
{2, {0xff, 0xfe, 0x00, 0x00}}, // UTF-16 (LE)
};
for (int i = 0; i < sizeof(boms) / sizeof(boms[0]); ++i) {
if (length >= boms[i].length && memcmp(string, boms[i].mark, boms[i].length) == 0) {
string += boms[i].length;
length -= boms[i].length;
break;
}
}
JSONKit only parses UTF-8, so the other encodings of the BOM definitely shouldn't be simply skipped.
@mcoombe , In my case the BOM were at the end of the string. I added the following block:
for (int i = 0; i < sizeof(boms) / sizeof(boms[0]); ++i) { char *endofstr = string + (length - boms[i].length); if (length >= boms[i].length && memcmp(endofstr, boms[i].mark, boms[i].length) == 0) { string[length- boms[i].length] = '\0'; length -= boms[i].length; break; } }