BenchmarkTools.jl icon indicating copy to clipboard operation
BenchmarkTools.jl copied to clipboard

Re-organise `@benchmark` display

Open mcabbott opened this issue 2 years ago • 33 comments

This proposes to re-organise show for @benchmark. The idea is:

  • Put the most important information first -- closes #236
  • Put all the numbers above the histogram
  • Make it easy to find the times, even when copied without colour -- they are (almost always) in the same left-to-right order as the histogram
  • Don't print empty numbers -- 0 allocations means no GC numbers; and in general fewer GC percentages
  • Print the maximum GC time, not the GC time of the longest run -- #248
  • Print the 99th percentile instead of the maximum time. This was already used as the right limit of the histogram.

The histogram is tweaked as follows:

  • The mean and median have markers, not to rely on colours. They are still green + blue
  • Bars with zero counts are drawn in gray
  • The limits are rounded to 2 significant figures
  • Using a log scale for counts is now tuned off by default -- #237 . (Maybe this is orthognal and should be a separate PR.)
  • A label "log(counts) by time" is printed only when the graph isn't an ordinary histogram. This is controlled by IOContext(stdout, :logbins => true) (as before)
  • The marker under the "overflow" bin is + not <.

To test on an example I didn't invent, here is the case from https://github.com/JuliaCI/BenchmarkTools.jl/issues/249 / this post. Big screenshots but IMO it's useful to look at a few of them together, since surely you are comparing things:

Screenshot 2021-09-21 at 10 57 46

Notice that in the 3rd case above, the mean and median co-incide, so the median marker is moved one to the left. A small lie, but omitting it seems like a puzzle. Note also that no GC times are printed because they are all zero.

Plain text:

julia> @benchmark for i in $w
           $a[i] = $b[i]
       end
BenchmarkTools.Trial:
│  min 2.625 μs, median 2.838 μs (½), mean 2.838 μs (*), 99ᵗʰ 3.231 μs
│  0 allocations
│  10_000 samples, each 9 evaluations:
│                          ½*                                                 
│       ▇                           █▁    ▁                                   
│  ▁▁▂▂▇█▆▄▂▂▂▃▅▃▂▂▃▆▆▃▂▃▇█▄▂▂▄█▆▃▂▃██▃▂▃▇█▅▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▂▁▂▂▂▁▁▁▁▁▁▁ ▃
└  2.6 μs                                                             3.3 μs +

Before (not exactly the same data). Notice the switch to log scale in the 2nd case. With the same size window, fits 4 less lines here.

Screenshot 2021-09-21 at 10 58 09

After, on a dark theme (not the exact same data):

Screenshot 2021-09-21 at 11 02 46

and on iTerm's version of Solarized Dark, with no minimum contrast:

Screenshot 2021-09-21 at 11 03 38

As we know from stack traces, the :light_black has been set equal to the background. Here that doesn't seem like a huge problem.

But also, this them sets :light_blue the same as the foreground, which is weird. I made the median bar that colour since I thought :blue was nearly invisible in some dark themes. But perhaps it should change back.

mcabbott avatar Sep 21 '21 15:09 mcabbott

One further step might be to put the number of samples etc. under the histogram -- it still refers directly to the data, but is out of the way. It's not really a property of what you are testing:

Screenshot 2021-09-21 at 12 32 57

mcabbott avatar Sep 22 '21 13:09 mcabbott

Glancing at the code, I feel like there may be a few too many commented out lines/blocks, however, nothing else jumps out at me :slightly_smiling_face:. I do have more feedback on the style though. I'll just go through the dot points you've raised here and then bring up one or two things from your last commit in #236.

The mean and median have markers, not to rely on colours. They are still green + blue

I've said it before, but I really like this. I do wonder if median(½) 4.607 μs may be worth considering over median 4.607 μs (½), but it's much of a muchness.

Bars with zero counts are drawn in gray

I'm not sure that this is a good idea. In asciihist I already implemented the condition that the minimum height should only be used for zero entries, and IMO making it gray too is excessive.

The limits are rounded to 2 significant figures

I think some rounding is a very good idea. On that note, I wonder of important/useful the 28 is in 10.328? I know I'd be happy without it. Using five significant figures implicitly assumes that the error term is of the order 10^-6 (relative to the result). Given the noise inherent to most systems, I think this is a bit of a stretch, and so would be in favour of going down to 3 significant figures with the timings.

Using a log scale for counts is now tuned off by default

This has been discussed before, and is sensible.

The marker under the "overflow" bin is + not <.

Hmmm, can't say I'm a fan of this. I see from the original issue you say you don't like < because it means "less than" and it's used to represent the overflow. However, I would contend that that's not quite correct, < means that the terms on the left are less than the terms on the right, and that accurately represents what's occurring here. Simply put, a < b means that b is bigger than a, and so to me this is a natural way of indicating that the bar represents times greater than the 99% value on the left.

Now, a few things from your comments in the issue.

breathing space for the 1st line

To me, I find the jagged line length more offputting than the lack of breathing space. I am still in favour of reordering the lines as previously mentioned.

[moving the number of samples down] ... to put the most important information first. If you want to copy the first few lines.

We've already got the least informative bit of information on the first line BenchmarkTools.Trial, and so to me moving the sample information down and onto a dedicated line actually prioritises what is IMO the least interesting bit of information. I'm still in favour of leaving this next to BenchmarkTools.Trial.

The histogram is set to fit in 80 characters for now. I wondered about making it auto-size to match the first line, but then lines with times XXX.YYY will make it longer, so different ones won't line up.

The difference in length with the numbers is only around 2-3 characters, so you can still make the length somewhat closer. Furthermore, as you're already constructing strings from the values you can just take the length of those strings. That is in fact what currently occurs with histwidth = 42 + lmaxtimewidth + rmaxtimewidth.

tecosaur avatar Sep 22 '21 14:09 tecosaur

The code is a mess BTW, sorry. Lots to tidy up.

median 4.607 μs (½)

I don't love this, maybe there's a better way. I hoped there would be a self-evident symbol but the best I could think of is ㊿, and that seems not to print as a single-width character. (For the mean, μ or might be options.)

I wonder of important/useful the 28 is in 10.328 μs

I'm pretty sure this is garbage. I presume the reason to print 3 figures after the . always is that these are the ns, it lines up with the steps of units, which sometimes makes comparisons easier?

not quite correct, < means that the terms on the left are less than

Yes, I can see the logic of how you got there. But it jars every time for me; the larger values are above, not to the right. And even in text 100 < x is always a little surprising. For me > would look better but I think you would hate it. And the fact that it can be seen both ways is, IMO, an argument for finding another way entirely.

If not +, then maybe some kind of ...? An arrow? Something referencing this being the top 1%... like 💯?

In traditional graphs, this is often marked with a symbol for breaking the axis. That might be another way, rather than labelling the last bar. ?

In asciihist I already implemented the condition that the minimum height should only be used for zero entries

This is good, and I agree that coding zero only by the colour would be too subtle. But until I read the source I wasn't sure that the smallest bar meant exactly zero, not just rounded down below the next-biggest bar. My hope is that this will be easier to infer with the difference of shade. (Especially with the log scale, although less of a concern if that's harder to run into.)

Maybe you considered this, but printing on the next line would be another option -- then there could be a continuous axis, in a different colour, with zero indicated by nothing at all above it. But then you can't also print text on the line below.

The difference in length with the numbers is only around 2-3 characters, so you can still make the length somewhat closer

Sure. It would be easy to make it fit the first line exactly. The reason I showed 3 together above, though, is that I think comparing many is common, and that it looks strange to me to have the graphs change size as you go down the page. It could be fixed and also match the first line, if the times were printed (say) always with 4 numbers, 9.328 μs and 10.32 μs and 410.3 μs.

I guess all the text is ragged-right here, nothing else is justified, and in this context a fixed length looks OK to me. And once we spend the lines on the graph, we may as well not squeeze it too much horizontally.

One reason to like ragged-right is that I think it's less damaged by word wrapping. In 50 columns:

Screenshot 2021-09-22 at 11 34 28

Edit: At the width shown in https://github.com/JuliaCI/BenchmarkTools.jl/pull/217#issuecomment-925164738, trying @benchmark copy!(Y, transpose(X)) shrinks from 34x23 to 34x19. It could shrink by 4 more lines if we avoid printing trailing spaces after all the histogram data.

mcabbott avatar Sep 22 '21 15:09 mcabbott

median 4.607 μs (½)

I was talking about the positioning of the (½), not the symbol used.

I presume the reason to print 3 figures after the . always is that these are the ns, it lines up with the steps of units, which sometimes makes comparisons easier?

This only makes sense when you arrange the values in a column and want to line up the decimal places. This already isn't happening, so why not just go with three significant figures?

<

I think we've both said all that's worth saying on this.

shading the zero bars

Hmm, I still don't see on the next line or shading it grey as good options. Particularly given we know that grey disappears for some people which makes it look ... weird.

bar length

If you do significant figures instead of decimal places, then the length of the longest line will be much more consistent, and I think you could just set a close length here and be no more than a char or two off.

text arrangement

I'm not referring the the ragged-right alightment. That's never been in question. I do content though that visually:

BenchmarkTools.Trial:
min 9.541 us, median 13.958 us (1/2), mean 16.363 us (*), 99th 19.917 us
4 allocations, 78.09 KiB
GC time: mean 2.277 us (13.92%), max 1.698 ms (98.36%)
10_000 samples, each 1 evaluation:

is not as nice as,

BenchmarkTools.Trial: 10_000 samples, each 1 evaluation
min 9.541 us, median 13.958 us (1/2), mean 16.363 us (*), 99th 19.917 us
GC time: mean 2.277 us (13.92%), max 1.698 ms (98.36%)
4 allocations, 78.09 KiB

For a few reasons:

  • the line length isn't jumping around as much
  • I like the layout that time is at the top and layout is at the bottom, instead of sandwiching the memory result in between two time results
    • also in terms of visual priority, it goes: first line, last line, second line and that matches the priority I'd give to time, allocs, GC
  • taking up less lines is good, and I don't see much of a downside to shelving the samples next to the trial. With the greying of BenchmarkTools.Trial it looks like more of a preamble than the first line anyway, which IMO gives it about the right degree of visual emphasis

tecosaur avatar Sep 22 '21 15:09 tecosaur

I was talking about the positioning of the (½), not the symbol used.

Sure. I meant that needing to provide a key at all seems a bit awkward, I think we agree the first line would look better without. We can try before/after etc, but I meant first to wonder whether there might be bigger changes to avoid that entirely.

Maybe if it marked all the quartiles, then 1/4 1/2 3/4 would be obvious without a key? Something like this (drawing, not data!)

julia> @benchmark for i in $w
           $a[i] = $b[i]
       end
BenchmarkTools.Trial:
│  min 2.625 μs, median 2.838 μs, mean 2.838 μs, 99ᵗʰ 3.231 μs
│  0 allocations
│             ¼            ½*                              ¾                 
│       ▇                           █▁    ▁                                   
│  ▁▁▂▂▇█▆▄▂▂▂▃▅▃▂▂▃▆▆▃▂▃▇█▄▂▂▄█▆▃▂▃██▃▂▃▇█▅▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▂▁▂▂▂▁▁▁▁▁▁▁ ▃
└  2.6 μs            10_000 samples, each 9 evaluations               3.3 μs ⋯

Symbols like ◔ ◑ ◕ ● might be an alternative to ¼ ½ ¾. In which case can mean the 100th percentile, and replace ?

BenchmarkTools.Trial:
│  min 2.625 μs, median 2.838 μs, mean 2.838 μs, 99ᵗʰ 3.231 μs
│  0 allocations
│             ◔            ◑*                              ◕                 
│       ▇                           █▁    ▁                                   
│  ▁▁▂▂▇█▆▄▂▂▂▃▅▃▂▂▃▆▆▃▂▃▇█▄▂▂▄█▆▃▂▃██▃▂▃▇█▅▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▂▁▂▂▂▁▁▁▁▁▁▁ ▃
└  2.6 μs            10_000 samples, each 9 evaluations               3.3 μs ●

mcabbott avatar Sep 22 '21 15:09 mcabbott

You could do m/M for median an mu for the mean.

tecosaur avatar Sep 22 '21 15:09 tecosaur

You could do m/M for median an mu for the mean.

tecosaur avatar Sep 22 '21 15:09 tecosaur

jagged linear/log scale of rounding to 2 digits; have you generated synthetic benchmark results and see if you're happy with the centration of the histogram in all cases?

Good point. One consequence of this rounding is that it sometimes demands a minimum width of 10%, but not always. Made-up cases with 2 samples, differing by 1% -- perhaps these should look more similar?

Screenshot 2021-09-28 at 13 56 52

I wonder a bit whether the graph should have a minimum width in all cases, perhaps even 10%. Useful graphs tend to be much wider, so perhaps seeing bars clumped together would be a reminder not to try too hard to interpret noise.

mcabbott avatar Sep 28 '21 18:09 mcabbott

I like the idea of a minimum 10% width. I bet going to 3 sigfigs would also sidestep the problem pretty well. (Yes, it will always be possible to find cases that do weird things with any number of sigfigs, but even with 3 I bet benchmarking noise would typically nix the oddities.)

timholy avatar Sep 28 '21 18:09 timholy

Also you might want to test against #261

vchuravy avatar Sep 28 '21 21:09 vchuravy

~~Tests are missing but~~ it does have a special case for 0 & 1 samples.

Experimenting with ◔ ◑ ◕ to mark quartiles, in a way obvious enough to avoid needing a legend, making the first line cleaner?

Also, maybe the right way to lower-bound the width at roughly 10% is to demand that the limits are outside mean+-5% (or maybe 3%).

Screenshot 2021-09-30 at 10 33 29

mcabbott avatar Sep 28 '21 21:09 mcabbott

Sorry, I only just noticed this request. I am pretty swamped with teaching and probably can't use this much, but I checked out the branch and will start using it when the occasion arises. But especially given the delay, it's worth commenting that I like what I'm seeing above!

timholy avatar Oct 08 '21 08:10 timholy

For the next week or so I'm terribly busy, but I'll try to make some time to have another look at this within the next few weeks.

tecosaur avatar Nov 02 '21 06:11 tecosaur

@tecosaur bump?

vchuravy avatar Dec 05 '21 17:12 vchuravy

Righteo. I've finally had time to have a look at this (I seem to have a problem with serially underestimating my todo list 😅).

For starters, this is what I'm seeing, and what I'm going to base my comments off: image

A fair few of my comments below have been made before in one form or another, but don't seem to have been addressed. Hopefully, a little more discussion can help improve some of the finer points of this proposal 🙂.

Styalistic comments

Display width

The current display is 73 char wide. Your PR produces progress bars that are up to 87 chars wide. I think it is quite good to stick to <=80 chars, since that is a very well established default term colwidth. This is fine when the term that it is displayed on is <80 chars as you've added resizing (thanks!), but this doesn't consider copy/pasting etc. which does seem to happen with benchmark results. For that reason, I'd be inclined to set the maximum width to 80 chars.

Quantile and Mean indication

and * are used with no explanation, other than the link in colour. The connection between mean and * etc. should be explicit, and discernable without colour (this is potentially an accessibility concern). An easy fix for this is to show the symbols in parenthesis after the term.

Furthermore, I'd use μ over * for indicating the mean. It's more statistically appropriate, and we're already using non-ascii chars with . I don't see any reason not to use it.

image

Significant figures

I still think we may as well reduce the number of significant figures to 2-3 digits, for the reasons stated in earlier comments.

GC time colouring

At the moment, I find the styling inconsistent. In the "execution time" line, the mean component mean 32.806 ns is entirely green, but in the GC time line, the mean 2.080 ns component is uncoloured, but just the (6.34%) component is green.

Allocation colouring

Comparing a few benchmarks done with the current code and your PR, I find myself missing the ability to jump to the memory estimate by looking for yellow.

Current PR: image

Allocations / GC time ordering

I still find it strange to sandwich the allocations/memory information between the execution time and GC times. The units now go time-bytes-time instead of time-time-bytes and it gives the allocations line less visual priority than the GC times line, which feels like the reverse of what it should be to me.

Some wording

I think Min time ... GC mean time ... is better than min ... GC time: mean ....

Plot legend

A one-line plot legend should describe the essential nature of the depiction, not give a piece of contextual information. Hence, I'm not sure that having 10_000 samples, each 995 evaluations as the plot legend is great. IMO it makes more sense at the top describing the nature of the benchmark trial that has occurred.

Restyling the plot based on my comments

I've gone ahead and put all my suggestions together, to produce an A/B comparison for convenience 🙂.

The current PR: image

My changes: image

Code Comments

Repeted structures

The code could do with DRYing out. I see quite a few repeated elements with minor changes, i.e. opportunities for discrepancies to pop up in future if a change isn't mirrored.

Underscore prefixed variables

I can't see an obvious logic as to why some local variables are prefixes with an underscore and some aren't.

A few finer comments

I see some specific bits of the code that I think could be improved, however, they could be affected by acting on some of the suggestions I've made above, or addressed in a future commit. Since they're not major issues, I'll leave them be for now.

Comments

Just in case this slips by, but there seems to be a lot of inactive/commented out code lying around.

Commit log

A rebase is in order, but I imagine that goes without saying.


Sorry it's taken so long, but hope this helps!

tecosaur avatar Dec 29 '21 18:12 tecosaur

I think it is quite good to stick to <=80 chars,

Linus Torvalds on this: https://lkml.org/lkml/2020/5/29/1038. Limit of 80 characters was set when screens were 4:3 @ 720 × 480, I don't know anyone who programs these days have a 4:3 screen, certainly not the 99% of users of Julia, out of everything...

I hope this PR gets merged if there are no major issues

Moelf avatar Feb 12 '22 17:02 Moelf

My 2c on column widths: It's not about my screen width, because I don't tend to have code files or repls take up the entire width of my screen. Splitting into 2/3 columns is often preferable.

tecosaur avatar Feb 12 '22 17:02 tecosaur

With 2/3 columns this will happily adjust down. With a huge screen, you won't have to turn your head.

Sorry have not got around to typing a reply to all this. Tried above to seed threaded discussions on separate points of possible debate, to be more bite-sized.

mcabbott avatar Feb 12 '22 17:02 mcabbott

Adjusting down seems like a good solution to me, and is not something I tested thoroughly (if it works, I'm happy with that).

tecosaur avatar Feb 12 '22 17:02 tecosaur

My suggestion would be to stay comfortably <108 characters, because that is the most Github will show, even though it may show much less (as few as 50 with side-by-side view):

00000000001111111111222222222233333333334444444444555555555566666666667777777777888888888899999999991234567890

vtjnash avatar Feb 12 '22 17:02 vtjnash

https://github.com/JuliaCI/BenchmarkTools.jl/pull/260#discussion_r720306154 is the explicit demonstration, in 55 columns. Care has been taken that line wrapping will be minimally disruptive on very narrow screens, like pasting into slack, even if initially run on a wider terminal. As explained there, it never expands beyond about 90, for which my reference is perhaps https://github.com/invenia/BlueStyle#synopsis (but, edit, thanks for the github 108 number).

mcabbott avatar Feb 12 '22 17:02 mcabbott

Great! 😀

tecosaur avatar Feb 12 '22 17:02 tecosaur

By the way, I took the current state of this PR and my modifications to some third parties to get some wider feedback: image

I asked which was preferred, and the four responses I got were:

  • B
  • B?
  • B, except I like “10_000 samples, each 995 evaluations” better as axis label than subtitle
  • I like B more, but I prefer “GC time: mean …, max …” over “GC mean time …, max …”

I also got an additional comment:

  • I appreciate the commitment to unicode but boy howdy does composing the th suffix out of individual monospaced ᵗʰ superscript characters not look great

So, I'm inclined to think the best course would be this PR + my modifications +

  • changing 99ᵗʰ to 99% (or 99th)
  • changing "GC mean time…" to "GC time: mean…"
  • and I'm no longer so bullish on moving the sample info out of the axis label, though my personal preference remains moving it up top

tecosaur avatar Feb 12 '22 18:02 tecosaur

My complaint remains that looking at one screenshot is pretty misleading. I think you need to use the hammer to find out what's good about it, not look at the picture in the catalogue. But failing that, I tried to show off various aspects in a variety of screenshots above.

Besides the scaling issue, what you miss is that the GC time is not printed when there is no GC time. Partly so as not to print "non-data ink", and partly to sharpen the difference between allocating and non-allocating versions.

And more... when I have time.

mcabbott avatar Feb 12 '22 18:02 mcabbott

There are indeed variants that look slightly different, however I don't think a barrage of screenshots with slight differences is actually that helpful, so I went with a "maximal" (showing all elements) version. Particularly as the hiding behaviour remains unaffected — I've just tweaked order and styling (which is most apparent in the "maximal" version, that I used), I think this is a valid comparison for the changes I have made.

tecosaur avatar Feb 12 '22 18:02 tecosaur

bump, I still really want this to happen

Moelf avatar May 06 '22 04:05 Moelf

What needs to be done to resolve this? Is anyone waiting on a review or does it just lack for some free time to finish it off?

If it's quibbling about details and everyone agrees this is an improvement over the status quo, I'd propose it's time to get this merged and then we can polish further in future PRs.

timholy avatar Jun 26 '22 13:06 timholy

There are currently merge conflicts, and unresolved discussion points. I do have the code for my (B) compromise/blend version from https://github.com/JuliaCI/BenchmarkTools.jl/pull/260#issuecomment-1037355070 which perhaps we'd all be happy with as an improvement on the current?

image

tecosaur avatar Jun 26 '22 15:06 tecosaur

What needs to be done to resolve this? Is anyone waiting on a review or does it just lack for some free time to finish it off?

Sorry I have not got around to writing up a reply to review points. I do think most have sensible answers, will try soon.

mcabbott avatar Jun 27 '22 03:06 mcabbott

still want this, bump

Moelf avatar Oct 23 '22 02:10 Moelf