xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Comprehensive benchmarking suite

Open dcherian opened this issue 4 years ago • 6 comments

I think a good "infrastructure" target for the NASA OSS call would be to expand our benchmarking suite (https://pandas.pydata.org/speed/xarray/#/)

AFAIK running these in a useful manner on CI is still unsolved (please correct me if I'm wrong). But we can always run it on an NCAR machine using a cron job.

Thoughts?

cc @scottyhq

A quick survey of work needed (please append):

  • [ ] indexing & slicing #3382 #2799 #2227
  • [ ] DataArray construction #4744
  • [ ] attribute access #4741, #4742
  • [ ] property access #3514
  • [ ] reindexing? https://github.com/pydata/xarray/issues/1385#issuecomment-297539517
  • [x] alignment #3755, #7738
  • [ ] assignment #1771
  • [ ] coarsen
  • [x] groupby #659 #7795 #7796
  • [x] resample #4498 #7795
  • [ ] weighted #4482 #3883
  • [ ] concat #7824
  • [ ] merge
  • [ ] open_dataset, open_mfdataset #1823
  • [ ] stack / unstack
  • [ ] apply_ufunc?
  • [x] interp #4740 #7843
  • [ ] reprs #4744
  • [x] to_(dask)_dataframe #7844 #7474

Related: #3514

dcherian avatar Dec 03 '20 18:12 dcherian

thanks for the ping @dcherian, i really like the idea! One other thing that often gets neglected in test suites is operating on remote data. I understand the need to avoid long-running tests and tests prone to network failures for PRs, but running these sorts of examples as a cron job could be very helpful for benchmarking and detecting issues.

In intake-xarray we recently added tests against a local HTTP server and "S3" server: https://github.com/intake/intake-xarray/blob/master/intake_xarray/tests/test_remote.py

Also added several simple tests requiring a network connection to public data (no auth required) that we run locally but not in CI currently: https://github.com/intake/intake-xarray/blob/master/intake_xarray/tests/test_network.py

scottyhq avatar Dec 03 '20 18:12 scottyhq

Thanks @scottyhq

One other thing that often gets neglected in test suites is operating on remote data.

This is lining up with the "pangeo integration tests" that came up in a Pangeo meeting (cc @rabernat).

Regardless whether it fits, I think adding benchmarks+tests for the xarray+zarr+fsspec (or xarray+mfdataset+netCDF) is an important and unmet need of the Pangeo community in general that we could address.

dcherian avatar Dec 04 '20 19:12 dcherian

This would be great.

Down a couple of levels — I think potentially we could run this as a cron job on GitHub Actions. NCAR would also be a good plan. I'm also happy to supply a VM if that's helpful.

max-sixty avatar Dec 30 '20 19:12 max-sixty

Looks like Quansight thinks that GH actions is a good place to benchmark scikit-learn: https://labs.quansight.org/blog/2021/08/github-actions-benchmarks/ so may be we can set that up for our existing benchmarks.

Here's the workflow: https://github.com/jaimergp/scikit-image/blob/main/.github/workflows/benchmarks-cron.yml

dcherian avatar Aug 18 '21 19:08 dcherian

@TomAugspurger are you still in charge of the pydata benchmarking machine? If so, could you add xarray to the list please (https://pandas.pydata.org/speed/)? @Illviljan has made major improvements so it should be a lot faster now

dcherian avatar Nov 08 '21 20:11 dcherian

"In charge of" is overstating it a bit. It's been segfaulting when building pandas and I haven't had a chance to debug it.

If / when I get around to fixing it I'll try adding xarray, but it might be a bit.

TomAugspurger avatar Nov 09 '21 12:11 TomAugspurger