trouble with demo.json validation
I'm trying to write some tests for a browser implementation that use the demo.json described in the spec. I'm seeing trouble once I hit row 215, col 222 - the 55262th id. If I understand right, this should be "encoded" as 55296. I notice that some parsers mention 55296 to 57343 as a range where UTF-16 surrogate pairs cannot be converted to UTF-8.
I'm serving up my tests (with <meta http-equiv="content-type" content="text/html; charset=UTF-8">) and demo.json with Apache to Chrome 17 (same behavior on Firefox 10). Thanks for any hints on what might be up. I'm not entirely confident this is UTF-8 through and through.
I've put together a basic Jasmine test spec to demonstrate the issue I'm seeing. Note that this is a fork of the mapbox/mbtiles-spec repo with the demo.json referenced in latest the UTFGrid spec.
I couldn't find any other UTFGrid related tests for the client. Let me know if I've missed some - seeing working tests would help figure out what might be going wrong on my side.
Thanks.
@tschaub - thanks for this report. Nothing immediately comes to mind about why this is failing. Its certainly possible it is a problem with the demo.json. I should have some time next week to dig into this a bit more.
/cc @kkaefer - any thoughts?
The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.
One way to deal with it would be to treat the strings as UTF-16 and decode them into an array of Numbers. We would then be able to use the entire Unicode range of 0 - 0x10FFFF (minus invalid JSON)
Something like this, with saner error handling
function utf16ToUnicode (str) {
var utf32 = 0,
isPair = false,
out = [],
len = str.length;
for(var i = 0, code; i < len; i++) {
code = str.charCodeAt(i);
if (!isPair) {
if ((code & 0xFC00) == 0xD800) {
// High surrogate of new pair sequence
utf32 = ((code & 0x3ff) << 10) + 0x10000;
isPair = true;
} else if ((code & 0xFC00) == 0xDC00) {
// Unexpected Low Surrogate
return false;
} else {
// BMP code point, pass straight through
out.push(code);
}
} else {
// When isPair is true, we expect a continuation of a surrogate pair
if ((code & 0xFC00) == 0xDC00) {
// Legal low surrogate
utf32 |= (code & 0x3FF);
out.push(utf32);
} else {
// Incomplete surrogate pair
return false;
}
utf32 = 0;
isPair = false;
}
}
return out;
}
Edit: Fixed decoding bug