zsh-bench
zsh-bench copied to clipboard
feat: add mean and stdev to stats
feat: add mean and stdev to stats
Summary: Currently, zsh-bench records the minimum value of each metric. This commit adds the more precise metrics mean and standard deviation.
Test plan:
Before:
$ ./zsh-bench
==> benchmarking login shell of user max ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=0
has_git_prompt=1
first_prompt_lag_ms==22.610
first_command_lag_ms==140.186
command_lag_ms==7.516
input_lag_ms==4.275
exit_time_ms==98.705
After:
$ ./zsh-bench
==> benchmarking login shell of user max ...
creates_tty=0
has_compsys=1
has_syntax_highlighting=1
has_autosuggestions=0
has_git_prompt=1
first_prompt_lag_ms==22.610±0.67
first_command_lag_ms==140.186±3.61
command_lag_ms==7.516±0.52
input_lag_ms==4.275±0.25
exit_time_ms==98.705±3.48
What's the motivation for the PR?
What's the motivation for the PR?
I saw a video by Sabine Hossenfelder who said, "Data without error bars isn't science.". Ever since then I've had this nagging voice in my head whenever I see data without error bars.
When I used this tool to measure my own zsh config I saw my command lag was "7.5ms", but I wasn't sure if it was exactly 7.5ms every time or if there was some variance. I added the standard deviation calculation to check the variance of the command lag.
We aren't doing science here. The purpose of zsh-bench is to optimize code. Neither mean nor stddev help with this.
We aren't doing science here. The purpose of zsh-bench is to optimize code. Neither mean nor stddev help with this.
Why wouldn't knowing whether or not a metric has a high variance help with optimizing code?
The purpose of zsh-bench is to optimize code.
If the measurement before a change is min=39, and the measurement after a change is min=37, did the change improve performance, make it worse, or have no effect?
If the tool then told you ±6, would that change your conclusion?
The noise comes from all sort of sources that are unrelated to the code we care about: from other applications, from CPU frequency scaling, etc. We don't want noise, so we want to somehow get rid of it. One option is to compute mean, the other is to compute min. The latter converges a whole lot faster, meaning that we can run the benchmark for less time.
Why wouldn't knowing whether or not a metric has a high variance help with optimizing code?
You are sending a patch to my code. If you cannot convince me why it's a good change, I won't merge it. This of course should not stop you from using your code yourself. Then, if I send you a patch, you won't merge it unless you'll believe it makes your code better.
@strager The answer to your first question is "it depends". The second is an easy "no": stddev is irrelevant when looking at the min. You should see it if you imagine looking at many samples of a random variable and picking the minimum. With the deviation of ±6, it won't take long for the min to converge.
We don't want noise, so we want to somehow get rid of it. One option is to compute mean, the other is to compute min. The latter converges a whole lot faster, meaning that we can run the benchmark for less time.
Point taken for mean. I agree min could be a better metric. However, I still think stddev is helpful. The stddev can help identify the inherent "noise" that you mention. The stddev can help point users to what their real lag will be instead of the best case. As strager pointed out, it can also help identify whether changes to your config affects performance.
I agree with "We don't want noise, so we want to somehow get rid of it". If I run the zsh-bench in master rn twice without changing any configs and set the iterations to 32, I'll still get slightly different results. Therefore, how can I know if changes to my config actually make a difference? The stddev is one tool I can use to inform my decision. Ideally I'd want some sort of p-value too, but as you said, "We aren't doing science here." 😛
If I run the zsh-bench in
masterrn twice without changing any configs and set the iterations to 32, I'll still get slightly different results. Therefore, how can I know if changes to my config actually make a difference? The stddev is one tool I can use to inform my decision.
Error bars for the min would be useful but stddev does not qualify. If you increase the number of iterations, the reported min from several runs will have smaller variation but the reported stddev won't change.
I presume that we all agree that > The purpose of zsh-bench is to optimize code. Though, the opinions on what the optimization target is vary apparently:
I got the impression that @romkatv so far intended to reduce the minimal execution time as a measure of factual overhead a shell extention such as z4h introduces. Others like @vegerot want to utilize zsh-bench to empirically inform their decision on whether (and in which way) a certain change in their shell configuration has impact on the their future shell experience.
As @romkatv mentioned himself: > We don't want noise, so we want to somehow get rid of it.
I agree that indeterministic behaviour is unfavorable in most circumstances, I can't agree though that one actually get's rid of the noise by > comput[ing the] mean, [or the] min. It's just a way of not reporting the observed noise by choosing a specific metric. Some care about the actual behaviour of their shell and noise is part of it.
I understand where @romkatv is coming from as they mention their assumption that > the noise comes from all sort of sources that are unrelated to the code we care about I have to disagree with this assumption though, as changing the shell configuration won't necessarily only impact the minimal overhead (for this the min metric is totally adequate and sufficient) but can also contribute to the variance of the actual overhead.
Example: Say one changes the shell config and thus makes the shell depend on some data that most of the time is fetched quite fast but sometimes with high latency, then this might change one's decision to opt-in for this change. IMHO it's not trivial or obvious that z4h runs perfectly deterministic (actually it does not, hence the min metric), so the question is what should be optimized for?
Many software projects relied on timing sh -lic "exit" before zsh-bench was introduced and as it turns out, this metric has not actually been a useful proxy for optimizing what users value.
I can't help myself but see the parallel in the way zsh-bench currently measures the min metric while users don't care about the optimal scenario only, but want to know what to actually expect in total.
It's a bit like a company selling some ultra fast storage medium with "up to 10GB/s". Technical users often want to know how the storage performs across the spectrum and not under ideal and lab controlled circumstances.
So while not arguing over whether the scientific method is relevant here or whether error bars are required for it, I must say that zsh-bench should somehow report what it actually measures. Like "Those are the minimal timings observed when running the measurment for X number of times."
I myself ran zsh-bench multiple times to observe the variance in the reported timings. Also, till I viewed this pull request, I thought those timings report the average timing, not the min.
Maybe a good solution would be to either provide an option to
- ) report/log all measured values, or
- ) report the min, the meadian, and the max
Either way, I think zsh-should inform the user along the data what the meaning of the data is (no pun intended).
zsh-bench is a tool that I use to optimize my code. This code's performance variance comes solely from the outside. Hence min gets rid of unwanted noise.
As mentioned in the wall of text before, I have to disagree with the assumption that "variance comes solely from the outside":
Changing the shell configuration won't necessarily only impact the minimal overhead but can introduce variance itself.
I might be wrong. Can you show the part of my code that has intrinsic performance variance?
I submit to the Unix philosophy of "Doing one thing and doing it well".
Neither the calculation of distribution metrics nor optimizing code is the primary purpose of zsh-bench (the latter being the job of the programmer). The goal of zsh-bench AFAIK is to offer a good benchmark. Whether offering a min metric as a default is a good UX design choice is a different topic.
Since zsh-bench offers the --raw flag, the data can be piped it into any program (e.g. like zsh-bench-hist) for further processing or visualization.
@vegerot I hope the tool suits your needs just as it suits mine.
I want to share some real data (although sadly unrelated to zsh4humans) with you:
When benchmarking unrelated projects, it's noteworthy that optimizing for reduced mean time, reduced variance of the mean time, and achieving the global minimum (among larger timings) can lead to different solutions being selected.
Consider the following results from 100,000 samples each:
| Method | Mean Time | Standard Deviation Time | Minimum Time |
|---|---|---|---|
| Using np.tile | 40.181 µs | 1110.338 µs | 8.821 µs |
| Using np.full | 30.678 µs | 1058.533 µs | 6.199 µs |
| Using np.broadcast_to | 42.571 µs | 1341.520 µs | 5.960 µs |
| Using list and np.array | 396.100 µs | 3509.649 µs | 95.367 µs |
@romkatv, I appreciate your inclusion of --raw, recognizing that there isn't typically a singular "correct" solution.
@vegerot, In regards to this PR, while displaying the standard error of the chosen metric could enhance comprehension, revealing the mean and standard deviation of the timings may not be as beneficial. This is because the distribution is not necessarily Gaussian, and analyzing the spread of the samples won't offer definitive insight into the standard error of the mean.
To present data with "error bars," one would need to execute the benchmark multiple times (or segment it), aggregating samples to the metric (such as mean/min, etc.) for each run, and then report the variance of the metric using an appropriate estimator. However, this approach differs from providing the standard deviation of the samples themselves.
For those seeking a general sense of the distribution, it might be more informative to report the minimum, maximum, as well as some intermediary percentiles like the 5th, 50th (median), and 95th percentile.