QuickTheories
QuickTheories copied to clipboard
String DSL should support valid UTF-8
I find that if I use the basicMultilingualPlaneAlphabet
from the string dsl that I get back invalid UTF-8; to generate a UTF-8 gen I have the following in my code
public static final Gen<String> UTF_8_GEN =
SourceDSL.strings()
.basicMultilingualPlaneAlphabet()
.ofLengthBetween(0, 1024)
.map(s -> new String(s.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
This conversion to bytes and back will drop all non-valid code points.
I'm not quite sure what you mean by "I get back invalid UTF-8". As far as I understand it Java uses UTF-16 to encode strings internally and you would specify a charset only when translating to bytes.
Sorry for not replying for a long time.
I have a lot of use cases which deal with serialization, so I want to make sure UTF-8 strings are serialized and deserialized without loss; there is a large assumption in most of my code that the original string is valid UTF-8. What I find when I use the code above is that the deserializing the string returns a different value, so the two strings are no longer .equals(o)
.
Looking up the UTF-8 code points, I see the max defined UTF 8 value is 99k but StringDSL defines 65k. I could totally be reading everything wrong (I use UTF-8, I don't know the spec at all =D), but that would imply to me that I should always get back UTF-8 chars; yet for some reason the string comes back as invalid UTF 8 and the Charset
will drop some chars.
My common use case is to deal with UTF-8
strings so I tend to define the generator above in every project.
I dug a bit deeper into the problem and checked for which codepoints forth and back conversion does not produce the same chars. The smallest one I found was 0xD800 which is the beginning of an area where Unicode does currently have no defined characters (see https://unicode-table.com). So the phenomenon will be the same when using UTF-16 for example.
So, maybe a better approach than providing a specialised UTF8 generator could be to (optionally) filter out all codepoints that have no defined character in unicode, e.g. like that
SourceDSL.strings()
.basicMultilingualPlaneAlphabet()
.ofLengthBetween(0, 1024)
.acceptOnlyValid("utf-8")