hypothesis Add `charset=` argument to restrict `st.characters()` to codepoints representable in that encoding

Reflecting on chardet/chardet#167 made me realise that it may often be useful to restrict generated characters to those valid for a particular encoding. In fact, that's exactly why we exclude surrogate characters from the default text() strategy - because they're invalid in utf-8!

This is obviously going to be much more efficient by construction than by filtering, and not too difficult - conceptually it's just an addition to the blacklist_characters argument, and can be implemented as such (though with some more attention to error messages for invalidly intersecting arguments, and some memoisation for performance).

Finally, the eternal question: what should we call this argument? I think the best option is codec=None, and ensure the error message asks for a codec name if we get a non-string value.

Nov 01 '18 12:11 Zac-HD

For many encodings, this can be done using st.binary and decoding?

from hypothesis import given, reject, strategies as some

@given(some.binary())
def test_latin3_specifically(self, encoded):
    try:
        text = encoded.decode('iso-8859-3')
    except ValueError:
        reject()

    ...

With encodings like utf-8 the percentage rejected could get high, but for the likes of iso-8859-* where most bytes map to characters and there aren't any sequences, it should work fine?

Dec 17 '19 03:12 sabik

It does work for some encodings, but rejection sampling is almost always less efficient than getting it right by construction. It also skews the distribution of sizes - because long strings are more likely to include a non-decodable sequence, this will be biased towards small strings and thus less effective at finding bugs :confused:

Dec 17 '19 03:12 Zac-HD

In addition to charset perhaps using Unicode Scripts would be a useful feature here?

Jan 10 '23 19:01 jenstroeger

In addition to charset perhaps using Unicode Scripts would be a useful feature here?

Sounds interesting, but I don't Hypothesis is the right place to put a policy on which codepoints should be generated for a given script - at least initially - and st.characters() and st.text() are pretty tightly tied to codepoints. For example, should "uncoded" be included for a particular script? What about "inherited"?

If a third-party extension wanted to work out that mapping and demonstrated that people found it useful, I'd be happy to merge it back in later! Just cautious about making opinionated decisions under our backwards-compatibility constraints without first exploring some concrete use-cases 🙂

Jan 11 '23 00:01 Zac-HD

Thanks for the heads-up, @Zac-HD. It won’t happen this week or next but I’ll look into finding (or creating) a dataset that we can use to map scripts to codepoints, and then wrap a st.character() around it. Something something… 🤔

Jan 11 '23 11:01 jenstroeger

After some poking around, I think a first stab at building such an aforementioned strategy would be to sort the data.

Unfortunately, Python’s own unicodedata module gives access to only limited character properties. However, looking at the Unicode Character Database (UCD) v14 we have the ucd/Scripts.txt which correlates codepoints and codepoint ranges with Script, Category, and actual Name.

Jan 20 '23 12:01 jenstroeger

No worries - we already vendor tlds-alpha-by-domain.txt, so all the machinery (including auto-updates, which should be rare for Unicode!) is in place 🙂

Jan 20 '23 12:01 Zac-HD

It occurs to me that the most common use-cases will be codec="ascii", codec="utf-8", and codec=None (ie no restriction). These are also pretty easy to implement, so we might support that and come back for other codecs later.

Feb 20 '23 08:02 Zac-HD

May I give this a try @Zac-HD

Jul 23 '23 08:07 Cheukting

Certainly!

I'm heading out on a two-week camping trip tomorrow, so don't worry when it takes me a long time to respond or review 😅

Jul 23 '23 17:07 Zac-HD

Enjoy @Zac-HD!

@Cheukting I’m also interested in this, so please let me know if there’s anything I can contribute.

Jul 23 '23 21:07 jenstroeger

@jenstroeger @Zac-HD Hey sorry it has been silent for a while. I started a new job last week and was a bit busy. I am diving back into this and will let you know if I have any updates :-)

Aug 06 '23 20:08 Cheukting

Best wishes for the new job, and no worries - the issue is almost five years old, not especially urgent 😉

Aug 06 '23 23:08 Zac-HD

hypothesis hypothesis copied to clipboard

Add `charset=` argument to restrict `st.characters()` to codepoints representable in that encoding

hypothesis
hypothesis copied to clipboard