hypothesis
hypothesis copied to clipboard
Add `charset=` argument to restrict `st.characters()` to codepoints representable in that encoding
Reflecting on chardet/chardet#167 made me realise that it may often be useful to restrict generated characters to those valid for a particular encoding. In fact, that's exactly why we exclude surrogate characters from the default text()
strategy - because they're invalid in utf-8
!
This is obviously going to be much more efficient by construction than by filtering, and not too difficult - conceptually it's just an addition to the blacklist_characters
argument, and can be implemented as such (though with some more attention to error messages for invalidly intersecting arguments, and some memoisation for performance).
Finally, the eternal question: what should we call this argument? I think the best option is codec=None
, and ensure the error message asks for a codec name if we get a non-string value.
For many encodings, this can be done using st.binary and decoding?
from hypothesis import given, reject, strategies as some
@given(some.binary())
def test_latin3_specifically(self, encoded):
try:
text = encoded.decode('iso-8859-3')
except ValueError:
reject()
...
With encodings like utf-8 the percentage rejected could get high, but for the likes of iso-8859-* where most bytes map to characters and there aren't any sequences, it should work fine?
It does work for some encodings, but rejection sampling is almost always less efficient than getting it right by construction. It also skews the distribution of sizes - because long strings are more likely to include a non-decodable sequence, this will be biased towards small strings and thus less effective at finding bugs :confused:
In addition to charset
perhaps using Unicode Scripts would be a useful feature here?
In addition to
charset
perhaps using Unicode Scripts would be a useful feature here?
Sounds interesting, but I don't Hypothesis is the right place to put a policy on which codepoints should be generated for a given script - at least initially - and st.characters()
and st.text()
are pretty tightly tied to codepoints. For example, should "uncoded" be included for a particular script? What about "inherited"?
If a third-party extension wanted to work out that mapping and demonstrated that people found it useful, I'd be happy to merge it back in later! Just cautious about making opinionated decisions under our backwards-compatibility constraints without first exploring some concrete use-cases 🙂
Thanks for the heads-up, @Zac-HD. It won’t happen this week or next but I’ll look into finding (or creating) a dataset that we can use to map scripts to codepoints, and then wrap a st.character()
around it. Something something… 🤔
After some poking around, I think a first stab at building such an aforementioned strategy would be to sort the data.
Unfortunately, Python’s own unicodedata module gives access to only limited character properties. However, looking at the Unicode Character Database (UCD) v14 we have the ucd/Scripts.txt which correlates codepoints and codepoint ranges with Script, Category, and actual Name.
No worries - we already vendor tlds-alpha-by-domain.txt
, so all the machinery (including auto-updates, which should be rare for Unicode!) is in place 🙂
It occurs to me that the most common use-cases will be codec="ascii"
, codec="utf-8"
, and codec=None
(ie no restriction). These are also pretty easy to implement, so we might support that and come back for other codecs later.
May I give this a try @Zac-HD
Certainly!
I'm heading out on a two-week camping trip tomorrow, so don't worry when it takes me a long time to respond or review 😅
Enjoy @Zac-HD!
@Cheukting I’m also interested in this, so please let me know if there’s anything I can contribute.
@jenstroeger @Zac-HD Hey sorry it has been silent for a while. I started a new job last week and was a bit busy. I am diving back into this and will let you know if I have any updates :-)
Best wishes for the new job, and no worries - the issue is almost five years old, not especially urgent 😉