encoding icon indicating copy to clipboard operation
encoding copied to clipboard

GBK encoding/decoding support

Open r12a opened this issue 9 years ago • 13 comments

Results for a series of tests for GBK encoding/decoding can be found at https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gbk

The tests can be run from that page (select the link in the left-most column) or get the tests from the WPT repo. There is a PR at https://github.com/w3c/web-platform-tests/pull/3194

The test check whether:

  1. the browser produces the expected byte sequences for all characters in the gbk encoding after 0x9F when encoding bytes for a URL produced by a form, using the encoder steps in the specification.
  2. the browser produces percent-escaped character references for a URL produced by a form when encoding miscellaneous characters that are not in the gbk encoding (tests for several ranges).
  3. same two types of test when writing characters to an href value
  4. the browser decodes all characters as expected from a file generated by encoding all pointers in the gbk encoding per the encoder steps in the specification.
  5. when decoding gbk text, the browser uses replacement characters as described by the algorithm in the Encoding spec.

The following summarises the current situation according to my testing, for major desktop browsers. (I will be adding nightly results and perhaps other browsers in time.) The table lists the number of characters that were NOT successfully converted by the test.

screen shot 2016-06-20 at 14 53 58

Notes:

  • all href tests fail for Edge because characters are not converted to percent-escapes
  • Firefox consistently falls to produce expected results for href tests for character not in the gbk encoding

Can we please investigate the failures to ascertain whether:

  1. the browser needs to be changed
  2. the spec needs to be changed
  3. the test is at fault

The following tool may be helpful for investigating issues. It converts between byte sequences and characters for all encodings in the Encoding spec. http://r12a.github.io/apps/encodings/

r12a avatar Jun 20 '16 14:06 r12a

The href failures are somewhat expected. Firefox doesn't use the same strategy as <form> there yet.

We should probably investigate the smaller number of failures from Chrome/Safari/Edge in encode/decode.

annevk avatar Jun 20 '16 14:06 annevk

List of bugs raised:

  • https://bugs.webkit.org/show_bug.cgi?id=159892
  • https://bugzilla.mozilla.org/show_bug.cgi?id=1285400
  • https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/8202282/
  • https://bugs.chromium.org/p/chromium/issues/detail?id=626406

r12a avatar Sep 15 '16 17:09 r12a

Chromium's failure is expected. See http://crbug.com/339862 and http://crbug.com/430823

jungshik avatar Sep 16 '16 06:09 jungshik

@r12a this is the first in a series of issues you raised about tests. I'm not sure how I can address these from a standards-perspective. What would it take to get them closed?

annevk avatar Nov 16 '16 08:11 annevk

yeah, these are a little unusual. I'm not sure (other than of course all implementations passing all tests).

We raised them here so that (a) they would be noticed, and so we had something to point to where people could hold a discussion if they wanted, although actually much of the discussion is taking place in the browser bugs raised, but also (b) so that we had a central location pointing to and perhaps from time to time summarising the implementation work/issues, so that people are notified of movement without having to subscribe to all the many bugs raised. (For example, i've been meaning to point to https://bugs.webkit.org/show_bug.cgi?id=159891, which seems to need suggestions on how to move forward.)

if you feel you want to close them we could do so, but perhaps we could add a comment saying that people can still contribute to the discussion while closed(?)

r12a avatar Nov 16 '16 15:11 r12a

I guess I'll leave them all open for now then. Not in a rush.

annevk avatar Nov 16 '16 17:11 annevk

AFAICT, the gbk tests differ from the spec for one code point to byte pair and vice versa mapping: The tests want A8 BC to decode to U+E7C7 and want U+E7C7 to encode to A8 BC. However, per spec, A8 BC maps to U+1E3F.

hsivonen avatar Apr 27 '17 15:04 hsivonen

Additionally, the tests seem to disargee with the spec on the handling of ASCII bytes as part of a malformed sequence when decoding: Fail step 2: 82 30 C3 assert_equals: expected "�" but got "�0�" Fail step 5.7: 82 FF C3 33 assert_equals: expected "��" but got "��3" Fail step 9: FF 30 C3 33 assert_equals: expected "�0�" but got "�0�3"

hsivonen avatar Apr 27 '17 15:04 hsivonen

Firefox Nightly 56 improved the encoder, but it still has one test failure. Apparently the decoder has no improvement.

vyv03354 avatar Jun 14 '17 13:06 vyv03354

Today and yesterday i updated the results at https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en#gbk for Firefox, FNightly, Chrome, and Canary. The latest summary is:

screen shot 2017-06-15 at 08 43 58

(There are columns for nightlies only where the results differ from the released versions.)

r12a avatar Jun 15 '17 07:06 r12a

Firefox's one failure is this:

U+E7C7  %A8%BC assert_equals: expected "%A8%BC" but got "%26%23%35%39%33%33%35%3B"

could be an issue with the test(?)

r12a avatar Jun 15 '17 07:06 r12a

Firefox's one failure is this:

U+E7C7  %A8%BC assert_equals: expected "%A8%BC" but got "%26%23%35%39%33%33%35%3B"

could be an issue with the test(?)

That's indeed a test bug. See upthread.

hsivonen avatar Jun 15 '17 10:06 hsivonen

Upstreaming these tests never completed. There's https://github.com/web-platform-tests/wpt/pull/20360 but it needs work.

annevk avatar Oct 17 '18 07:10 annevk