xarray
xarray copied to clipboard
Hypothesis strategies in xarray.testing.strategies
Adds a whole suite of hypothesis strategies for generating xarray objects, inspired by and separated out from the new hypothesis strategies in #4972. They are placed into the namespace xarray.testing.strategies
, and publicly mentioned in the API docs, but with a big warning message. There is also a new testing
page in the user guide documenting how to use these strategies.
- [x] Closes #6911
- [x] Tests added
- [x] User visible changes (including notable bug fixes) are documented in
whats-new.rst
- [x] New functions/methods are listed in
api.rst
EDIT: A variables
strategy and user-facing documentation were shipped in https://github.com/pydata/xarray/pull/8404
I also added my chunking strategy from https://github.com/HypothesisWorks/hypothesis/issues/3433
@Zac-HD would you be able to give your hypothesis-expert opinions on the general design of these strategies? Especially the whole "strategies that accept multiple other strategies" thing.
@Zac-HD would you be able to give your hypothesis-expert opinions on the general design of these strategies? Especially the whole "strategies that accept multiple other strategies" thing.
I'll aim for a proper review tonight! Quick remarks: strategies accepting strategies is fine, though our API style guide suggests accepting strategies xor values (this is violated in a few places in our Numpy extra, but it's still a good base principle). The guides might have other useful advice, I'd recommend skimming them since you're planning to ship Hypothesis-extending code to quite a few people!
I'll aim for a proper review tonight!
Amazing! Thank you!
Quick remarks: strategies accepting strategies is fine, though our API style guide suggests accepting strategies xor values
Oh I hadn't seen that! Looks like I independently made similar API decisions though, as far as I can tell.
OK, reviewed ✅
Overall it looks pretty good, but there are a couple of places where the API you've got is pushing you into some really nasty performance traps where you have to use rejection sampling to keep things consistent. We have some tricks to help, but it's fundamentally exponential scaling - which means that it works right up until it doesn't work at all, and most of your users are likely to hit that regime 😕 Definitely possible to fix that, but it's more like a redesign than patching a typo.
Thanks for the feedback @Zac-HD, that's extremely helpful already! Sounds like we need to decide on the general API approach, then I can go back and fix all the internals.
(unsubbing for noise, please @-me when you'd like another review!)
@Zac-HD I think this is ready for another review (or a least a reply to the couple of unresolved comments)!
It might take me a little while to get through all this, but this is a great review, thanks again @Zac-HD !
(generally staying unsubbed; so please ping me whenever you've got questions or would like another review!)
@Zac-HD if I could request one more review please! The two remaining problems for me are:
- How should we alter the API of
datasets
to make it easier to constructDataset
objects containing duck-typed arrays (see https://github.com/pydata/xarray/pull/6908#discussion_r974066789) - Why does the example generation performance seem to have gotten worse? :confused: I added what I thought were small refactors (e.g.
_sizes_from_dim_names
) which may somehow be related, but I'm not sure.
@Zac-HD if I could request one more review please! The two remaining problems for me are:
Absolutely! Some quick comments this evening; I would also like to do a full review again before merge but that might be next week or weekend - I'm out for a conference from early Thursday.
- How should we alter the API of
datasets
to make it easier to constructDataset
objects containing duck-typed arrays (see Hypothesis strategies in xarray.testing.strategies #6908 (comment))
Replied in the thread above.
- Why does the example generation performance seem to have gotten worse? 😕 I added what I thought were small refactors (e.g.
_sizes_from_dim_names
) which may somehow be related, but I'm not sure.
I'd be pretty surprised if that was related, st.fixed_dictionaries()
is internally basically just that zip()
trick anyway. I'd guess that this is mostly "as you implement the last few complicated data-gen options, they start taking nonzero time", but not confident in that without reading closely and probably measuring some perf things.
How do we move this forward? Even Xarray objects with just numpy arrays would be quite useful
I think #8404 made a lot of progress on this, including shipping the user-facing documentation. If you wanted to open a PR rebasing this set of changes on main, I think that might be most of the remaining work.
So I just did a monster merge of main
into this branch (probably should still rebase). It won't work yet because we still need to propagate all the array_strategy_fn
stuff that went through with #8404 into the signatures of the new strategies in this PR.
How do we move this forward?
It's mostly just dealing with the above and also making sure we can generate sets of variables with alignable dimensions efficiently. We also probably should think about what we want the signatures of the more complicated strategies to be: e.g. are we wanting to pass variables
to datasets
? or array_strategy_fn
to datasets
?
Even Xarray objects with just numpy arrays would be quite useful
A lot of the work that went into #8404 was working out how to make it general enough to handle non-numpy arrays.