fitnesse icon indicating copy to clipboard operation
fitnesse copied to clipboard

Fix incorrect character count in SlimSerializer with surrogate pairs.

Open epistax opened this issue 2 years ago • 11 comments

When serializing Slim streams, count surrogate pairs as a single character in character counts.

This prevents a crash in CSlim that occurs when the number of characters specified exceeds the number of characters present.

epistax avatar Mar 30 '23 20:03 epistax

I'm wondering how other Slim server implementations (e.g. Java, .NET) handle this situation. Will this be a breaking change for them?

jediwhale avatar Apr 03 '23 15:04 jediwhale

I'm wondering how other Slim server implementations (e.g. Java, .NET) handle this situation. Will this be a breaking change for them?

Completely fair question. I don't know if they've relied on this existing behavior or not. It seems clear no one tried using surrogate pairs with CSlim before but maybe they are more popular on other implementations.

A different option could be to define "character" as a UTF-16 "element" in the Slim documentation and change CSlim. It's less preferred for me (because then I have more work to do!) but doable.

epistax avatar Apr 03 '23 15:04 epistax

I would be happy to continue to maintain this patch if I could know what direction we want to take this. Right now either the code or the documentation really ought to be changed, and Slim implementation library projects would want to know in either case!

Thanks

epistax avatar May 03 '23 13:05 epistax

I have to admit I'm not familiar enough with unicode an surrogate pairs to see what the impact might be. I consider it quite likely no one ever tried to use surrogate pair with FitNesse before, and no tests break with your change. So in principle I would consider it mergeable. But I would like to see actual acceptance tests use surrogate pairs besides the unit test you now added.

It should actually be fairly straightforward to have a test in the project to check how the Java implementation of Slim will deal with this I suppose. Just have a test that send such pairs from the wiki to a Slim example fixture and see they are received properly in the Java class implementing the fixture. And have another method where the fixture emits a such a surrogate pair as part of a method result and ensure it ends up in the wiki correctly.

We already have sample fixtures for Java in the project it doesn't sound hard. But (not sounding hard and not taking a lot of time can be two quite different things I learned and) I personally don't have the time to set this up.

@jediwhale I would assume you have some insight into the impact on the .Net Slim server...

fhoeben avatar May 03 '23 14:05 fhoeben

I'll test with .NET

jediwhale avatar May 03 '23 14:05 jediwhale

This is a breaking change for .NET. The .NET Slim implementation assumes the length prefix for a string is the number returned by the string Length property, which is the number of UTF-16 characters, not the number of Unicode code points. e.g. the following test passes because the second character in the string is stored as a surrogate pair of UTF-16 characters. Assert.AreEqual(6, "h🀜llo".Length); I suspect the same is true in Java but I haven't tested it yet there. If so, the documentation is incorrect and should be changed from "the number of characters in the string" to "the number of UTF-16 characters in the string"

jediwhale avatar May 04 '23 23:05 jediwhale

This is a breaking change for .NET. The .NET Slim implementation assumes the length prefix for a string is the number returned by the string Length property, which is the number of UTF-16 characters, not the number of Unicode code points. e.g. the following test passes because the second character in the string is stored as a surrogate pair of UTF-16 characters. Assert.AreEqual(6, "h🀜llo".Length); I suspect the same is true in Java but I haven't tested it yet there. If so, the documentation is incorrect and should be changed from "the number of characters in the string" to "the number of UTF-16 characters in the string"

Wouldn't a 4-byte UTF-16 character have two code units but still only be a single UTF-16 character? Per the javadoc: "Returns the length of this string. The length is equal to the number of Unicode code units in the string.". Indeed this was the root of our problem because on the cslim side we were interpreting this number as the number of characters rather than the number of code units.

Meanwhile I believe a character is something that maps to a grapheme. And it a surrogate pair it takes more than a single UTF-16 code unit to get there.

At any rate, it sounds like I should change cslim to expect the number mean the number of code units in the string!

epistax avatar May 05 '23 13:05 epistax

Yes, you're right - I should have said code units, not characters!

jediwhale avatar May 05 '23 15:05 jediwhale

I guess the weird thing about this is that we're saying that the UTF-8 slim protocol needs to be aware of the UTF-16 interpretation of the data. So to be complaint we may be need to go from UTF-8 on either/both sides of the communication to UTF-16 using a library like icu or iconv to see what the UTF-16 representation would be, and report on that for the size of the text fields? Err maybe I'm a little turned around here, but that sounds like what we're saying?

I'm now wondering why this isn't just number of bytes.

epistax avatar May 05 '23 20:05 epistax

I agree it's not the best protocol. It is very Java-centric, based on how Java handles strings. Other platforms have to try and duplicate what Java is doing. For .NET, it's not too bad, because Microsoft copied what Java does. For others, not so simple.

jediwhale avatar May 05 '23 21:05 jediwhale

Updated documentation and added tests #1433

jediwhale avatar May 05 '23 21:05 jediwhale