Benchmarking with asv
Todo
- [x] remove example benchmarks
- [x] Add integration test benchmarks (See
asv_bench/benchmarks/benchmarks_integration.pyfor an example.)- [x] advection 2d
- [x] ARGO float example
- [x] tutorial_nemo_curvilinear.ipynb (See
asv_bench/benchmarks/benchmarks_particle_execution.pyfor an example.)
- [x] Add more detailed timing benchmarks
- [x] time the execution of 1000 particles for 1 time step
- [x] time the execution of 1000 particles for 100 time steps
When applicable, split into different phases. We could, e.g., go for a .setup() method for fieldset creating, and a .time_execute() method for the particle execution.
This PR introduces benchmarking infrastructure to the project via asv. Benchmarks can be run on a pull request by adding the run-benchmarks label to it. Two environments will be created in the CI runner with the prior and proposed changes, and both suites of benchmarks will be run and compared against each other.
Note that this PR only has example benchmarks for the timebeing until we can discuss benchmarks of interest.
The running of the benchmarks in CI is only one aspect of the benchmarking (ie, only for core parcels functionality). Using asv, we can create different suites of benchmarks (e.g., one for CI, and one for more heavy simulations). The benefit of using asv is everything else that comes out of the box with it, some being:
- being able to run benchmarks easily across several commits, visualising them in a dashboard
- easily create and manage these benchmarking environments
- profiling support to dive into locations of slowdowns
- community support
Changes:
- asv configuration (conf file, benchmarks folder, and CI workflow)
- asv documentation (available via the community page in the maintainer section)
I have done some testing of the PR label workflow in https://github.com/VeckoTheGecko/parcels/pull/10 . We can only test this for PRs in OceanParcels/parcels when its in master
Related to #1712
@erikvansebille On the topic of performance, are you also experiencing it taking something like 10s occasionally to run import parcels?
@erikvansebille On the topic of performance, are you also experiencing it taking something like 10s occasionally to run
import parcels?
Yes, I also experience this slow import sometimes. Not sure why...
Setting to draft until we have some actual benchmarks that we can include in this.
From meeting:
We can use tutorial_nemo_curvilinear.ipynb as well
The Argo tutorial at https://docs.oceanparcels.org/en/latest/examples/tutorial_Argofloats.html is also a quite nice simulation for benchmarking, as it has a 'complex' kernel. It took approximately 20s to run on v3.1 in JIT mode, and no takes 50s on my local computer to run in Scipy-mode.
@danliba or @willirath, could you add the Argo tutorial to the benchmark stack?
Can we also use some of #1963?
If you wish to use parts of #1963 and don't already have a Copernicus ocean account with the right access, email me and I can set up a quick way for you to download the circulation files
So I've spent some time with this over the weekend. Here's a few insights:
-
ASV implicitly assumes that benchmarks don't change. This is most obvious from the fact that
asv runwill always discover benchmarks from$PWD/benchmarks/with whatever is present in this directory at the time of invoking ASV. Hence adapting semantically identical benchmarks to changing API needs work in the.setup()method of benchmark suites. -
ASV's more shiny features (accumulation and publication of benchmarks results over long times and across multiple versions) have never really been adopted widely. All the scipy and pydata projects I've checked (dask, numpy, asv-runner) have ASV benchmarks defined, but their auto generated and published reports are outdated for years. Also, all the example benchmark results given in the ASV docs didn't see any update for at least 4 years (see their astropy example, their numpy example, the other numpy example and their scipy example).
-
ASV has a way of compiling relative results of benchmarks for different commits taken in one run using
asv continuous. This is what's done in the github workflow from an earlier commit and is used in xarray for detecting performance degradations introduced PRs. This workflow is, however, rarely used and was only requested in in 93 PRs out of more than 2000 PRs in the same time range. -
Then env management in ASV appears to be in a phase of restructuring and especially the (faster)
mambabased envs only work with enforcing pretty old versions of Mamba. It looks as if the ASV devs have given up on Mamba in favour of moving towardsrattlersupport. This, however, also is nowhere near fully implemented. As a result, the only env management that worked smoothly with Parcels' compiler dependencies and did not result in thousands of lines of warnings and errors was"conda".
My recommendation is to forget about anything along the lines of "automation", "continuous", etc.
ASV provides a great way of defining benchmarks, running them against a bunch of similar well-behaved (i.e. no API changes or dependency changes) commits and then comparing the results with a rather narrow scope along the time / version axis. We should focus on this core functionality and leave the step of running and interpreting the benchmarks to the individual dev for now.
My recommendation is to forget about anything along the lines of "automation", "continuous", etc.
ASV provides a great way of defining benchmarks, running them against a bunch of similar well-behaved (i.e. no API changes or dependency changes) commits and then comparing the results with a rather narrow scope along the time / version axis. We should focus on this core functionality and leave the step of running and interpreting the benchmarks to the individual dev for now.
Yes, I think this sounds good. I was talking with an xarray maintainer, and he mentioned that ASV isn't really used much in CI - the intermitency of github runners adds quite some noise which means that unless a regression degrades performance by a factor, it won't be obvious in output.
ASV provides a great way of defining benchmarks, running them against a bunch of similar well-behaved (i.e. no API changes or dependency changes) commits and then comparing the results with a rather narrow scope along the time / version axis.
I think so as well. Allowing us to define benchmarks for v3, port them to v4, then work in v4 with them will allow devs to have a targeted lens on performance (even if its just local - which is perfectly fine).
Another note: As part of this PR can we remove dependence on the parcels/tools/timer.py::Timer class? (which is currently used in a couple examples as simple benchmarking)
Ideally once this is merged we can completely remove the Timer class from the codebase