pyhf
pyhf copied to clipboard
research: time metrics with honeycomb
Description
See the python SDK: https://github.com/honeycombio/libhoney-py
Workflow I had in mind
- master: baseline time that we always know works (and we can do nightly metrics to be sure)
- PR opens: if it makes one of our metrics slower, trigger some alert or comment on what's slower, link to honeycomb w/ details or w/e
In general, we won't merge in PRs unless we can fix the slow stuff.
@ismith:
caveat: honeycomb is ideally meant for a lookback window of no more than 2 weeks; you can set a query to look back up to two months so in events, you can specify fields. you get duration_ms for ~free; you also might get function name for free. but you'll want, in config, to add maybe branch name and PR [id] and then you can do a query that creates a graph of master vs non-master, and define thresholds for yay/nay. we do not have an automated github check, so no automated enforcement, but we do offer slack/email/pagerduty and webhooks if you want somehting custom if you blog this when you're done we'll give you stickers and maybe a tshirt in a coveralls world this might be configurable as "PR is red, may not merge" same as if you failed CI we don't offer that out of the box and i don't know that you want that. but setting it up to comment on the PR is not hard to build with a webhook
Probably also worth looking at airspeed velocity as this seems to be basically exactly what I had in mind.
This might be worth looking into if we can get an external grant to pay for us to run a small Digital Ocean or AWS instance to host this. Seems pretty valuable.
Probably also worth looking at airspeed velocity as this seems to be basically exactly what I had in mind.
NumPy and SciPy use asv for benchmarks, so it might be worth looking at how they do it.
- Benchmarking NumPy with Airspeed Velocity (this is probably the most helpful)
- GHA that runs
asvfor NumPy - Relevant section of test script (maybe not the best to try and generalize from)
An interesting thing is that asv will go and run tests on old commits automatically so you can automatically build the performance history.
I think(?) this might be possible to do with just a repo over in the pyhf org that runs things on a cron job.
c.f. also Is GitHub Actions suitable for running benchmarks?, where the answer is: yes.
And https://github.com/pydata/xarray/pull/5796 provides basically a template for how to do all of this!
In https://github.com/glotzerlab/signac/pull/776 @bdice mentions
We deleted the CI script for benchmarks from
signac2.0 anyway, because it's not reliable and we want to useasvinstead.
@bdice I would love to talk to you about asv sometime as we've been wanting to set that up for pyhf for a while but haven't yet. If you have insights on how to get going with it I'd be quite keen to learn.
You can see signac's benchmarks defined here: https://github.com/glotzerlab/signac/blob/master/benchmarks/benchmarks.py
And the asv config: https://github.com/glotzerlab/signac/blob/master/asv.conf.json
And here's a quick reference I wrote on how to use asv: https://docs.signac.io/projects/core/en/latest/support.html#benchmarking
I have mixed feelings about it. It can be difficult to make asv do what I want sometimes, and the project's development has been rather slow. Sometimes I wish for features that don't exist (like being able to have greater control over test setup/teardown to ensure that caches are cleared between runs without having to regenerate input data -- something like pytest fixtures would be helpful). I've run into a handful of situations while running asv that felt like bugs but were difficult to trace down. I don't know of better alternatives to asv unless you have the time and energy to roll your own Python scripts, which is what signac had done for a long time. Eventually the maintenance of those DIY scripts and their limitations were annoying enough that outsourcing to asv felt like a good decision.
edit: I read some of the thread above. I have had really mediocre experiences with running benchmarks as a part of CI or on shared servers. Dedicated local hardware is the only way I've ever gotten metrics that I really trust, especially for a project like signac that is heavy on I/O. The results from Quansight on GitHub Actions were extremely helpful for calibrating my own experience of annoyance with CI benchmarks in the past. I don't think the metrics they see for false positives and highly noisy data are good enough for what the signac project has needed in the past -- local benchmarks are much less variable in my experience.
Hi folks, @matthewfeickert asked me to leave my 2 cents here a few days ago. Basically 2 things:
Dedicated local hardware is the only way I've ever gotten metrics that I really trust, especially for a project like signac that is heavy on I/O.
This is 100 % correct. Here are the benchmarks we ran a few years ago in poliastro: the noisy lines are my own laptop (supposedly without doing anything else), the almost straight line is a cheap, dedicated server we rented on https://www.kimsufi.com/. Slower, but infinitely more useful.

I have mixed feelings about it. It can be difficult to make asv do what I want sometimes, and the project's development has been rather slow.
Recently they got a grant https://pandas.pydata.org/community/blog/asv-pandas-grant.html and managed to revamp the CI and make a release. The project has not seen more commits since then, so I agree it's not very active, but I'm not aware of any alternatives. The closest one would be https://github.com/ionelmc/pytest-benchmark/, but it's equally inactive.
Following up on @astrojuanlu's excellent points, I was talking with @gordonwatts at the 2022 IRIS-HEP Institute Retreat about this and he mentioned that he might have some dedicated AWS machines that we could potentially use (or at least trial a demo). Gordon, if you can elaborate on this as my memory from last week isn't as clear as it was the next day.
We have an account that is connected with IRIS-HEP for benchmarking (@masonproffitt and I were going to use this for some benchmarking for our ADL Benchmark paper work, but it didn't happen). This is still active. Only Mason and I have access. But you get a dedicated machine of a certain specific size (at least, that is what the web interface says). So if one can basically build a script that does the complete install and then runs the test, this can be a cheap-ish way to run these.