encoding icon indicating copy to clipboard operation
encoding copied to clipboard

gb18030 encoding/decoding support

Open r12a opened this issue 9 years ago • 22 comments

Results for a series of tests for gb18030 encoding/decoding can be found at https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gb18030

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at https://github.com/w3c/web-platform-tests/pull/3195

The test check whether:

  1. the browser produces the expected byte sequences for all characters in the gb18030 index after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
  2. the browser produces the expected byte sequences for miscellaneous characters not in the gb18030 index when encoding bytes for a URL produced by a form, using the encoder steps in the specification. (tests for several ranges)
  3. same two types of test when writing characters to an href value
  4. the browser decodes all characters as expected from a file generated by encoding all pointers in the gb18030 index per the encoder steps in the specification.
  5. the browser decodes all characters as expected from a file generated by encoding miscellaneous characters not in the gb18030 index per the encoder steps in the specification. (tests for several ranges)
  6. when decoding gb18030 text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

screen shot 2016-06-23 at 19 41 14

Notes:

  • all href tests fail for Edge because characters are not converted to percent-escapes

Can we please investigate the failures to ascertain whether:

  1. the browser needs to be changed
  2. the spec needs to be changed
  3. the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

r12a avatar Jun 20 '16 14:06 r12a

I updated the table. Something went wrong with the scoring for Safari in the previous version. Should now reflect reality.

r12a avatar Jun 23 '16 18:06 r12a

List of bugs raised:

  • https://bugs.webkit.org/show_bug.cgi?id=159891
  • https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/8202252/

r12a avatar Sep 15 '16 17:09 r12a

Note to self: Chrome shows the 3 failures for decode errors, but when the API (TextDecoder) is used it produces the the correct result. I believe this is because Chrome doesn't "flush" at the end of resource (i.e. non-API) streams... might be the same in other browsers.

inexorabletash avatar Sep 16 '16 00:09 inexorabletash

FYI, Chromium may soon change the decoding table to map 28 byte sequences that used to be mapped to PUA code points (completely useless and even harmful on platforms where there's no font to cover those PUA code points) to regular Unicode characters (see #22, #27 and http://crbug.com/645783 ).

jungshik avatar Sep 16 '16 06:09 jungshik

wrt Safari support for GB 18030 (and probably other encodings), there's a discussion at https://bugs.webkit.org/show_bug.cgi?id=159891 that people following the Encoding issues may be able to help with.

r12a avatar Nov 18 '16 17:11 r12a

I'm loathe to jump in on that bug, but ISTM the answer for WebKit should be "normalize on input". Once it's in the DOM, normalization should not happen, as @r12a points out.

inexorabletash avatar Nov 18 '16 17:11 inexorabletash

I added a comment to that effect.

annevk avatar Nov 18 '16 18:11 annevk

The tests seem to disargee with the spec on the handling of ASCII bytes as part of a malformed sequence when decoding: Fail step 2: 82 30 C3 assert_equals: expected "�" but got "�0�" Fail step 5.7: 82 FF C3 33 assert_equals: expected "��" but got "��3" Fail step 9: FF 30 C3 33 assert_equals: expected "�0�" but got "�0�3"

hsivonen avatar Apr 27 '17 15:04 hsivonen

@hsivonen just so you know, i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere. I've had even less time than normal lately because of various distractions.

r12a avatar Jun 15 '17 07:06 r12a

i'm still intending to check the above and change the tests (and results) where needed, when i get a moment. Same goes for similar comments elsewhere.

Great. Thank you.

hsivonen avatar Jun 15 '17 09:06 hsivonen

So, I've been helping rebase @r12a's pull requests, fix lint errors, and address some review comments from https://github.com/w3c/web-platform-tests/pull/3194 that apply to all of them. I am happy to continue doing that as I have it down to a pretty fast process. Which means if @r12a can just edit his remaining WPT PRs with the normative changes and then ping the appropriate thread, I am happy to carry things through to the finish line. Woohoo!

domenic avatar Jun 15 '17 22:06 domenic

@hsivonen wrt https://github.com/whatwg/encoding/issues/57#issuecomment-297747386, i have stepped through the conversion for the first two tests you mentioned several times, using the debugger alongside the spec text, and i still come up with the results expected by the test, rather than the results i get from Firefox (nightly). Are you able to point out for me why the test produces a different result from FF?

Here's a link to the test: https://www.w3.org/International/tests/repo/encoding/legacy-mb-schinese/gb18030/gb18030-decode-errors.html Thanks.

r12a avatar Aug 15 '17 17:08 r12a

I looked at the step 2 test. Below the iterations and the results:

  1. gb18030 first = 0x82
  2. gb18030 first = 0x82, gb18030 second = 0x30
  3. gb18030 first = 0x82, gb18030 second = 0x30, gb18030 third = 0xC3
  4. Conditional in step 2 is true
  5. Conditional in step 1 is true

So @r12a is correct.

annevk avatar Aug 16 '17 07:08 annevk

@hsivonen ping wrt https://github.com/whatwg/encoding/issues/57#issuecomment-322540517

r12a avatar Oct 04 '17 12:10 r12a

Sorry about the delay.

Firefox, Chrome and Safari agree with each other on the 3 remaining failures. Edge is closer to the other browses than to the spec.

So I think this is a spec bug. (And the tests reflect the spec.)

hsivonen avatar Nov 06 '17 10:11 hsivonen

Hmm. The Firefox situation might be confused somewhere between the Chrome/Safari behavior and the spec behavior. I need to investigate this more.

hsivonen avatar Nov 06 '17 11:11 hsivonen

OK. Here's what's happening:

Firefox implements the spec, but the test case doesn't test the spec. The test case expectations are written as if there was EOF after each examined sequence. However, the test input is not TextDecoder calls with just the sequences tested (each ending in EOF). Instead, it's an HTML file containing all the sequences within spans, so what comes after each test sequence isn't an EOF but a less-than sign (of the span end tag).

I will need to test what Chrome, Safari and Edge do when the sequences actually end in EOF, but my tentative opinion is that it's bad for the spec collapse a bogus sequence of bytes to a different output when the bogus sequence is followed by EOF vs. when it's followed by something else that's not a valid continuation of the sequence.

hsivonen avatar Nov 07 '17 07:11 hsivonen

I wrote demos that exercise both the followed-by-end-tag case and the followed-by-EOF case.

The spec, Firefox, Chrome and Safari agree on these, so I think it's the best not to change the spec even though it is rather unfortunate for the treatment of the bogus byte sequence to differs depending on what comes after.

In conclusion, this is a test case bug after all.

hsivonen avatar Nov 07 '17 08:11 hsivonen

@r12a are you planning on updating the tests?

annevk avatar Apr 25 '18 09:04 annevk

Some time ago, the Encoding Standard started mapping the two bytes "0xA3 0xA0" to "U+3000" rather than U+E5E5 "to be compatible with deployed content".

Do the benefits of this mapping still outweigh the disadvantages even today? The answer depends largely on—

  • how commonly GB18030 (as opposed to other encodings) is used in Web sites today compared to when the mapping was made,
  • whether, today compared to when the mapping was made, the GB18030 bytes "0xA3 0xA0" are still commonly used instead of the proper mapping for U+3000 to represent a space character in Web sites,
  • how frequently Web sites that use "0xA3 0xA0" instead of the proper mapping for U+3000 are seen today compared to when the mapping was made,
  • whether GB18030 encoders/decoders that convert "0xA3 0xA0" to or from U+3000 (besides implementations of the Encoding Standard) are still widely deployed today,
  • the effort required for Encoding Standard implementations to update their GB18030 mapping tables,
  • the need and desirability to comply strictly with GB18030-2005, and
  • other factors weighing for or against this mapping.

peteroupc avatar Jul 08 '18 15:07 peteroupc

Given that Firefox/Chrome/Safari have the same behavior, we'd need data indicating that changing implementations strongly improves compatibility with deployed content.

This is a variation of your 2nd point. If the number of sites using "0xA3 0xA0" intending U+E5E5 has significantly increased, it would be a consideration.

I don't think the other points would be directly relevant to implementors making a decision.

inexorabletash avatar Jul 10 '18 20:07 inexorabletash

Same problem as with the gbk tests, upstreaming never completed: https://github.com/web-platform-tests/wpt/pull/20361.

annevk avatar Oct 17 '18 07:10 annevk