utfgrid-spec icon indicating copy to clipboard operation
utfgrid-spec copied to clipboard

trouble with demo.json validation

Open tschaub opened this issue 13 years ago • 4 comments

I'm trying to write some tests for a browser implementation that use the demo.json described in the spec. I'm seeing trouble once I hit row 215, col 222 - the 55262th id. If I understand right, this should be "encoded" as 55296. I notice that some parsers mention 55296 to 57343 as a range where UTF-16 surrogate pairs cannot be converted to UTF-8.

I'm serving up my tests (with <meta http-equiv="content-type" content="text/html; charset=UTF-8">) and demo.json with Apache to Chrome 17 (same behavior on Firefox 10). Thanks for any hints on what might be up. I'm not entirely confident this is UTF-8 through and through.

tschaub avatar Feb 28 '12 00:02 tschaub

I've put together a basic Jasmine test spec to demonstrate the issue I'm seeing. Note that this is a fork of the mapbox/mbtiles-spec repo with the demo.json referenced in latest the UTFGrid spec.

I couldn't find any other UTFGrid related tests for the client. Let me know if I've missed some - seeing working tests would help figure out what might be going wrong on my side.

Thanks.

tschaub avatar Mar 01 '12 01:03 tschaub

@tschaub - thanks for this report. Nothing immediately comes to mind about why this is failing. Its certainly possible it is a problem with the demo.json. I should have some time next week to dig into this a bit more.

/cc @kkaefer - any thoughts?

springmeyer avatar Mar 01 '12 02:03 springmeyer

Surrogates are invalid UTF-8

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

One way to deal with it would be to treat the strings as UTF-16 and decode them into an array of Numbers. We would then be able to use the entire Unicode range of 0 - 0x10FFFF (minus invalid JSON)

saik0 avatar Jun 05 '13 11:06 saik0

Something like this, with saner error handling

function utf16ToUnicode (str) {
    var utf32 = 0,
        isPair = false,
        out = [],
        len = str.length;

    for(var i = 0, code; i < len; i++) {
        code = str.charCodeAt(i);
        if (!isPair) {
            if ((code  & 0xFC00) == 0xD800) {
                // High surrogate of new pair sequence
                utf32 = ((code & 0x3ff) << 10) + 0x10000;
                isPair = true;
            } else if ((code & 0xFC00) == 0xDC00) {
                // Unexpected Low Surrogate
                return false;
            } else {
                // BMP code point, pass straight through
                out.push(code);
            }
        } else {
            // When isPair is true, we expect a continuation of a surrogate pair
            if ((code & 0xFC00) == 0xDC00) {
                // Legal low surrogate
                utf32 |= (code & 0x3FF);
                out.push(utf32);
            } else {
                // Incomplete surrogate pair
                return false;
            }
            utf32 = 0;
            isPair = false;
        }
    }
    return out;
}

Edit: Fixed decoding bug

saik0 avatar Jun 05 '13 11:06 saik0