quickcheck
quickcheck copied to clipboard
Arbitrary instance for Char and non-ascii chars
while working on Data.Text
i noticed that Arbitrary
instance for Char
just picks a random character from the range (0,255)
because of that, the tests there failed to catch at least 1 bug (https://github.com/bos/text/issues/176)
the tests are weaker than they seem since they don't test the cases where 1 code point is represented by multiple words
a) we can fix the quickcheck
package
b) we can have a newtype
over Char/String
just for Text
does anyone feel strongly about //not// generating non-ascii Chars?
The git version of QuickCheck generates Unicode characters too, see #119.
Having said that, I don't have a feeling for what sort of test data distribution is best when generating Unicode strings (e.g., should we generate surrogate pairs or "funny" characters with higher probability?), so improvements are welcome!
cool! any plans on releasing a new version?
Yes, soon! The main thing I am waiting for is #158, then I plan to make a release as soon as possible.
Actually we don't currently generate surrogate pairs. Maybe that should be fixed!
(Even so, I get test failures running cabal test
on text
with QuickCheck git, so maybe the problem is wider?)
See the quickcheck-unicode package for The Right Thing To do (TM).
@bos @nick8325
I had no idea of the existence of quickcheck-unicode when I sent #119 last year
Ideally, if there's any change that quickcheck might benefit from, in the instances for types present in base
, I think that it should be merged here.
I'm a bit confused by quickcheck-unicode
's README:
The default Arbitrary instance for the Char type intentionally generates only ASCII values.
I don't know the full history of Quickcheck... is it truly intentional?
Anyhow, I had a look at the implementation:
https://github.com/bos/quickcheck-unicode/blob/master/Test/QuickCheck/Unicode.hs
And, while favoring planes 0 and 1 over planes 2 and 14 sounds interesting, I'm not sure if that's truly the right way:
Unicode is an evolving standard, and quickcheck-unicode fell a bit behind (in fact, it seems that it hasn't been updated for the newer standard versions)... as a quick example ANATOLIAN HIEROGLYPH A001
, codepoint 0x14400
is part of the standard, but doesn't seem to be generated by quickcheck-unicode
Most programs shouild be able to accept any unicode codepoints, even if they were not standardized when the program was first compiled, as long as the system's font supports them, the user shouldn't notice.
For this reason, I think that it might be better to just generate all valid (non invalid surrogate pairs) codepoints, like the current code is doing.
The only drawback would be that this way it generates a lot of garbage Chars, but since anyway most stuff in the non-BMP planes does not have wide fonts support, there wouldn't be much difference with explicitly allowing only some characters groups from the planes. And to make things less unreadable we're already skewing the frequency in favor of ASCII.
I think that for "friendlier" unicode generation, the PrintableString
instance already does a good job:
https://github.com/nick8325/quickcheck/blob/ecc3650/Test/QuickCheck/Arbitrary.hs#L1082
OTOH, that eventually ends up relying on iswprint, and a case could be made for reducing dependency on the libc, I guess?
Thank you for all your work :)
@nick8325
Actually we don't currently generate surrogate pairs. Maybe that should be fixed!
That's intentional, [Char]
is sort-of equivalent to a UTF-32 encoding and strict UTF-32 and UTF-8 encoders/decoders should reject surrogate pairs, even if they are valid pairs
http://unicodebook.readthedocs.io/unicode_encodings.html
An UTF-8 or UTF-32 encoder should not encode surrogate characters (U+D800—U+DFFF), see Non-strict UTF-8 decoder.
http://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder
in fact, Data.Text.pack
of chr 0x10330
will work just fine, but pack [chr 0xd800, chr 0xdf30]
(which is the surrogate pair representation of 0x10330
) will just yield a couple of 0xfffd
replacement characters, which would then prevent to round-trip it back via unpack
.
There might still be value in generating valid surrogate pairs, to help test non-strict decoders (but there would be some significant added complexity in ensuring the the encoding is otherwise valid, and for something like that, the AllUnicode
newtype could've been enough)