Make RustPython benchmarks readable
TL;DR: benchmarks are poorly readable and could be greatly improved. This is key element in convincing people of the soundness of RustPython so it should probably not be neglected IMHO.
The violin plots available here are not easily readable and their Y-axes labels are hardly readable at all because they got left-cut at some point. This is especially troublesome for the MICROBENCHMARKS section, for which it is impossible to tell RustPython from CPython.
This issue could be alleviated by doing the following:
- Use a specific color for CPython and another one for RustPython (and keep this color pair consistent across all plots).
- Always have CPython data on top and RustPython data on bottom (this is not consistent: in the
EXECUTIONtab, CPython is on top and RustPython on bottom, while in tabPARSE_TO_ASTit is the other way around). - Only keep the name of the benchmark in the Y-axis labels, i.e. replace
execution/mandelbrot.py/cpythonby eitherMandelbrot(and use a legend to indicate which color is which interpreter), or make a plot title sayingMandelbrotand use the Y-axis labels to tell whether it is CPython or RustPython.
In addition to these visual issues, some other improvements could be implemented:
- Make the plots user-friendly using some interactive backend such as
plotly. - Put hyperlinks to the benchmark script location / source-code, so that users can check what the benchmarks are actually doing.
- In the same line of thought, add a small descriptive text about what the benchmark does / why it is relevant (for instance "benchmark X is particularly I/O intensive" or whatnot).
- On top of the page, give the hash of the commit / version (possibly with release date to know at a glance if they're outdated or not) of both CPython and RustPython binaries that were used, whether they were recompiled with
-o3locally, as well as the machine specs (this would allow for meaningful comparison and reproducibility).
I think that benchmarks one of the key element that might convince anyone to switch from one interpreter to another (apart from functionalities / low-level bindings). Hence they should not be neglected.
If someone could point me to where these plots are generated, I'd be happy to help typesetting them / add further info (although I might need some technical support about why benchmark X is especially relevant or not).
Thanks for bringing this up. These issues do need to be addressed. Currently our benchmarks graphs are handled by criterion: https://bheisler.github.io/criterion.rs/book/user_guide/comparing_functions.html
Sadly from what I know, criterion doesn’t support graphing config.
The workflow that updates these benchmarks is here: https://github.com/RustPython/RustPython/blob/c97f4d1daf02fe08d73f8571787377ca28a0cdaa/.github/workflows/cron-ci.yaml#L111
Now everything else (display-wise) is located here: https://github.com/RustPython/rustpython.github.io/blob/master/_layouts/benchmarks.html
I think the last 3 bullet points should be very feasible to implement. The graph related ones could either be done during CI via a custom script or on the frontend with a js plotting library.
I don't think anyone needs to be convinced of anything; in my opinion, you use the right tool depending on the project. Maybe it would be a good idea to ask if RustPython should be included in various distribution tools like pyenv or similar.
In any case, I can try to understand and solve this issue.
From what I could understand, Criterion uses a parameter in the function criterion.rs/src/plot/gnuplot_backend/summary.rs::fn violin(), setting the plot color via the constant DARK_BLUE.
I'm not experienced enough in Rust to confirm whether this constant can be modified before compile-time depending on whether you're testing "RustPython" or "CPython."
Even when changing Criterion's backend, this parameter doesn't seem to be available, not to mention criterion-plot (an unmaintained library).
One method to change the color scheme appears to be using a Criterion.toml in the project's root, but it still doesn't modify the violin plots, which are far too dense to be distinguishable.
[colors]
# These are used in many charts to compare the current measurement against
# the previous one.
current_sample = {r = 31, g = 120, b = 180}
previous_sample = {r = 7, g = 26, b = 28}
# These are used by the full PDF chart to highlight which samples were outliers.
not_an_outlier = {r = 31, g = 120, b = 180}
mild_outlier = {r = 5, g = 127, b = 0}
severe_outlier = {r = 7, g = 26, b = 28}
# These are used for the line chart to compare multiple different functions.
comparison_colors = [
{r = 230, g = 25, b = 75}, # red
{r = 60, g = 180, b = 75}, # green
{r = 255, g = 225, b = 25}, # yellow
{r = 0, g = 130, b = 200}, # blue
{r = 245, g = 130, b = 48}, # orange
{r = 145, g = 30, b = 180}, # purple
{r = 70, g = 240, b = 240}, # light cyan
{r = 240, g = 50, b = 230}, # magenta
]
The solution shold be to find a library that can replace Criterion as quickly as possible while also returning a similar HTML report.