xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Hypothesis strategies in xarray.testing.strategies

Open TomNicholas opened this issue 1 year ago • 15 comments

Adds a whole suite of hypothesis strategies for generating xarray objects, inspired by and separated out from the new hypothesis strategies in #4972. They are placed into the namespace xarray.testing.strategies, and publicly mentioned in the API docs, but with a big warning message. There is also a new testing page in the user guide documenting how to use these strategies.

  • [x] Closes #6911
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst

EDIT: A variables strategy and user-facing documentation were shipped in https://github.com/pydata/xarray/pull/8404

TomNicholas avatar Aug 11 '22 15:08 TomNicholas

I also added my chunking strategy from https://github.com/HypothesisWorks/hypothesis/issues/3433

TomNicholas avatar Aug 11 '22 15:08 TomNicholas

@Zac-HD would you be able to give your hypothesis-expert opinions on the general design of these strategies? Especially the whole "strategies that accept multiple other strategies" thing.

TomNicholas avatar Aug 16 '22 14:08 TomNicholas

@Zac-HD would you be able to give your hypothesis-expert opinions on the general design of these strategies? Especially the whole "strategies that accept multiple other strategies" thing.

I'll aim for a proper review tonight! Quick remarks: strategies accepting strategies is fine, though our API style guide suggests accepting strategies xor values (this is violated in a few places in our Numpy extra, but it's still a good base principle). The guides might have other useful advice, I'd recommend skimming them since you're planning to ship Hypothesis-extending code to quite a few people!

Zac-HD avatar Aug 16 '22 17:08 Zac-HD

I'll aim for a proper review tonight!

Amazing! Thank you!

Quick remarks: strategies accepting strategies is fine, though our API style guide suggests accepting strategies xor values

Oh I hadn't seen that! Looks like I independently made similar API decisions though, as far as I can tell.

TomNicholas avatar Aug 16 '22 22:08 TomNicholas

OK, reviewed ✅

Overall it looks pretty good, but there are a couple of places where the API you've got is pushing you into some really nasty performance traps where you have to use rejection sampling to keep things consistent. We have some tricks to help, but it's fundamentally exponential scaling - which means that it works right up until it doesn't work at all, and most of your users are likely to hit that regime 😕 Definitely possible to fix that, but it's more like a redesign than patching a typo.

Zac-HD avatar Aug 17 '22 06:08 Zac-HD

Thanks for the feedback @Zac-HD, that's extremely helpful already! Sounds like we need to decide on the general API approach, then I can go back and fix all the internals.

TomNicholas avatar Aug 17 '22 18:08 TomNicholas

(unsubbing for noise, please @-me when you'd like another review!)

Zac-HD avatar Aug 17 '22 19:08 Zac-HD

@Zac-HD I think this is ready for another review (or a least a reply to the couple of unresolved comments)!

TomNicholas avatar Sep 02 '22 19:09 TomNicholas

It might take me a little while to get through all this, but this is a great review, thanks again @Zac-HD !

TomNicholas avatar Sep 06 '22 15:09 TomNicholas

(generally staying unsubbed; so please ping me whenever you've got questions or would like another review!)

Zac-HD avatar Sep 06 '22 21:09 Zac-HD

@Zac-HD if I could request one more review please! The two remaining problems for me are:

  1. How should we alter the API of datasets to make it easier to construct Dataset objects containing duck-typed arrays (see https://github.com/pydata/xarray/pull/6908#discussion_r974066789)
  2. Why does the example generation performance seem to have gotten worse? :confused: I added what I thought were small refactors (e.g. _sizes_from_dim_names) which may somehow be related, but I'm not sure.

TomNicholas avatar Sep 19 '22 14:09 TomNicholas

@Zac-HD if I could request one more review please! The two remaining problems for me are:

Absolutely! Some quick comments this evening; I would also like to do a full review again before merge but that might be next week or weekend - I'm out for a conference from early Thursday.

  1. How should we alter the API of datasets to make it easier to construct Dataset objects containing duck-typed arrays (see Hypothesis strategies in xarray.testing.strategies #6908 (comment))

Replied in the thread above.

  1. Why does the example generation performance seem to have gotten worse? 😕 I added what I thought were small refactors (e.g. _sizes_from_dim_names) which may somehow be related, but I'm not sure.

I'd be pretty surprised if that was related, st.fixed_dictionaries() is internally basically just that zip() trick anyway. I'd guess that this is mostly "as you implement the last few complicated data-gen options, they start taking nonzero time", but not confident in that without reading closely and probably measuring some perf things.

Zac-HD avatar Sep 20 '22 06:09 Zac-HD

How do we move this forward? Even Xarray objects with just numpy arrays would be quite useful

dcherian avatar Apr 01 '24 02:04 dcherian

I think #8404 made a lot of progress on this, including shipping the user-facing documentation. If you wanted to open a PR rebasing this set of changes on main, I think that might be most of the remaining work.

Zac-HD avatar Apr 01 '24 04:04 Zac-HD

So I just did a monster merge of main into this branch (probably should still rebase). It won't work yet because we still need to propagate all the array_strategy_fn stuff that went through with #8404 into the signatures of the new strategies in this PR.

How do we move this forward?

It's mostly just dealing with the above and also making sure we can generate sets of variables with alignable dimensions efficiently. We also probably should think about what we want the signatures of the more complicated strategies to be: e.g. are we wanting to pass variables to datasets? or array_strategy_fn to datasets?

Even Xarray objects with just numpy arrays would be quite useful

A lot of the work that went into #8404 was working out how to make it general enough to handle non-numpy arrays.

TomNicholas avatar Apr 01 '24 14:04 TomNicholas