quickcheck icon indicating copy to clipboard operation
quickcheck copied to clipboard

Arbitrary instance for Char and non-ascii chars

Open kuk0 opened this issue 7 years ago • 7 comments

while working on Data.Text i noticed that Arbitrary instance for Char just picks a random character from the range (0,255) because of that, the tests there failed to catch at least 1 bug (https://github.com/bos/text/issues/176) the tests are weaker than they seem since they don't test the cases where 1 code point is represented by multiple words

a) we can fix the quickcheck package b) we can have a newtype over Char/String just for Text

does anyone feel strongly about //not// generating non-ascii Chars?

kuk0 avatar May 03 '17 20:05 kuk0

The git version of QuickCheck generates Unicode characters too, see #119.

Having said that, I don't have a feeling for what sort of test data distribution is best when generating Unicode strings (e.g., should we generate surrogate pairs or "funny" characters with higher probability?), so improvements are welcome!

nick8325 avatar May 03 '17 21:05 nick8325

cool! any plans on releasing a new version?

kuk0 avatar May 04 '17 12:05 kuk0

Yes, soon! The main thing I am waiting for is #158, then I plan to make a release as soon as possible.

nick8325 avatar May 04 '17 13:05 nick8325

Actually we don't currently generate surrogate pairs. Maybe that should be fixed!

(Even so, I get test failures running cabal test on text with QuickCheck git, so maybe the problem is wider?)

nick8325 avatar May 04 '17 15:05 nick8325

See the quickcheck-unicode package for The Right Thing To do (TM).

bos avatar May 21 '17 05:05 bos

@bos @nick8325

I had no idea of the existence of quickcheck-unicode when I sent #119 last year

Ideally, if there's any change that quickcheck might benefit from, in the instances for types present in base, I think that it should be merged here.

I'm a bit confused by quickcheck-unicode's README:

The default Arbitrary instance for the Char type intentionally generates only ASCII values.

I don't know the full history of Quickcheck... is it truly intentional?

Anyhow, I had a look at the implementation:

https://github.com/bos/quickcheck-unicode/blob/master/Test/QuickCheck/Unicode.hs

And, while favoring planes 0 and 1 over planes 2 and 14 sounds interesting, I'm not sure if that's truly the right way:

Unicode is an evolving standard, and quickcheck-unicode fell a bit behind (in fact, it seems that it hasn't been updated for the newer standard versions)... as a quick example ANATOLIAN HIEROGLYPH A001, codepoint 0x14400 is part of the standard, but doesn't seem to be generated by quickcheck-unicode

Most programs shouild be able to accept any unicode codepoints, even if they were not standardized when the program was first compiled, as long as the system's font supports them, the user shouldn't notice.

For this reason, I think that it might be better to just generate all valid (non invalid surrogate pairs) codepoints, like the current code is doing.

The only drawback would be that this way it generates a lot of garbage Chars, but since anyway most stuff in the non-BMP planes does not have wide fonts support, there wouldn't be much difference with explicitly allowing only some characters groups from the planes. And to make things less unreadable we're already skewing the frequency in favor of ASCII.

I think that for "friendlier" unicode generation, the PrintableString instance already does a good job:

https://github.com/nick8325/quickcheck/blob/ecc3650/Test/QuickCheck/Arbitrary.hs#L1082

OTOH, that eventually ends up relying on iswprint, and a case could be made for reducing dependency on the libc, I guess?

Thank you for all your work :)

berdario avatar May 21 '17 20:05 berdario

@nick8325

Actually we don't currently generate surrogate pairs. Maybe that should be fixed!

That's intentional, [Char] is sort-of equivalent to a UTF-32 encoding and strict UTF-32 and UTF-8 encoders/decoders should reject surrogate pairs, even if they are valid pairs

http://unicodebook.readthedocs.io/unicode_encodings.html

An UTF-8 or UTF-32 encoder should not encode surrogate characters (U+D800—U+DFFF), see Non-strict UTF-8 decoder.

http://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder

in fact, Data.Text.pack of chr 0x10330 will work just fine, but pack [chr 0xd800, chr 0xdf30] (which is the surrogate pair representation of 0x10330) will just yield a couple of 0xfffd replacement characters, which would then prevent to round-trip it back via unpack.

There might still be value in generating valid surrogate pairs, to help test non-strict decoders (but there would be some significant added complexity in ensuring the the encoding is otherwise valid, and for something like that, the AllUnicode newtype could've been enough)

berdario avatar May 21 '17 20:05 berdario