JSONKit Handling Unicode Byte Order Marks

During deserialization, the current behaviour of JSONKit is to halt parsing with an "unexpected token" type error if a byte order mark is encountered at the beginning of the JSON data.

This is highly undesirable, as many text editors (windows Notepad being a prime example) used to manually produce JSON documents do not provide an option to omit a BOM when saving text encoded using Unicode. Thus, documents written using such editors cannot be parsed using JSONKit. Requiring document authors to manually remove BOMs using alternate software or to install a particular text editor is not an ideal solution.

Put another way, BOMs are commonplace and being unable to parse documents containing them is a serious limitation of JSONKit.

I propose adding another JKParseOption flag (say, JKParseOptionConsumeBOMs or similar) that would enable a mode in which leading BOMs are skipped. This allows users for whom the current behaviour is desirable to retain that behaviour (by not specifying JKParseOptionConsumeBOMs), and users who expect BOMs to occur to be able to handle them transparently.

For reference, I have achieved the desired BOM-compatible behaviour by adding the following code after the NSCParameterAssert in the _JKParseUTF8String method (only tested with UTF-8 BOMs):

struct {
  int length;
  const unsigned char mark[4];
}
boms[] = {
  {4, {0x00, 0x00, 0xfe, 0xff}}, // UTF-32 (BE)
  {4, {0xff, 0xfe, 0x00, 0x00}}, // UTF-32 (LE)
  {3, {0xef, 0xbb, 0xbf, 0x00}}, // UTF-8
  {2, {0xfe, 0xff, 0x00, 0x00}}, // UTF-16 (BE)
  {2, {0xff, 0xfe, 0x00, 0x00}}, // UTF-16 (LE)
};
for (int i = 0; i < sizeof(boms) / sizeof(boms[0]); ++i) {
  if (length >= boms[i].length && memcmp(string, boms[i].mark, boms[i].length) == 0) {
    string += boms[i].length;
    length -= boms[i].length;
    break;
  }
}

Feb 06 '12 05:02 mcoombe

JSONKit only parses UTF-8, so the other encodings of the BOM definitely shouldn't be simply skipped.

Nov 09 '12 03:11 ksperling

@mcoombe , In my case the BOM were at the end of the string. I added the following block:

for (int i = 0; i < sizeof(boms) / sizeof(boms[0]); ++i) { char *endofstr = string + (length - boms[i].length); if (length >= boms[i].length && memcmp(endofstr, boms[i].mark, boms[i].length) == 0) { string[length- boms[i].length] = '\0'; length -= boms[i].length; break; } }

Jun 11 '13 15:06 toptierlabs

JSONKit JSONKit copied to clipboard

Handling Unicode Byte Order Marks

JSONKit
JSONKit copied to clipboard