benchexec
benchexec copied to clipboard
Documentation: Part I
Improve README with usage notes
Thanks! Now I can see much better what the changes are.
Some thoughts that come to my mind when reading this:
- Right now we have two places where each tools mentioned / described: A quick overview at the top of
doc/INDEX.md
and the documentation of each tool itself. This PR adds a third place (README.md
). And actually we have kind of yet another place if we count the "features" list in the readme, which although not explicitly mapped to the tools does correspond). I wonder whether that becomes too much. I agree that we need a much better overview description than what we currently have inINDEX.md
but maybe it makes more sense to just extend that part (also with a figure) instead of adding more toREADME.md
? Would you have found it at the top of the documentation when you started trying to understand BenchExec? Should we reference each tool from the feature (in the feature list) that it provides? - The new text starts with "BenchExec is more than just the tool
benchexec
", but at this point the reader has never heard ofbenchexec
so far. We would need to start differently. - I prefer starting with
benchexec
: It is the main tool, and it provides the most features to users, so it is the one that we recommend for users.runexec
is basically intended for those users who cannot usebenchexec
. - The part about system requirements is now duplicated in
README.md
. Is this intended?
What do you think about this? I would highly appreciate hearing your thoughts about these questions, because it is not easy to consider what is the best reading flow for new people.
Adding to README
The new text starts with "BenchExec is more than just the tool benchexec", but at this point the reader has never heard of benchexec so far. We would need to start differently.
I'd rather remove it from the other places. Without runexec being mentioned there, I wouldn't look further. From personal experience: I did briefly look at BenchExec several years ago during my PhD and thought BenchExec is just benchexec
, which looked way too complicated for what I wanted to do (run ~10 runs of a single tool). runexec
I would have definitely appreciated and used.
I prefer starting with benchexec: It is the main tool, and it provides the most features to users, so it is the one that we recommend for users. runexec is basically intended for those users who cannot use benchexec.
I would argue that (initially) benchexec
is far less interesting than runexec
for most users. Setting up benchexec
for a single tool evaluation likely is perceived as too cumbersome, while using runexec
as a drop-in replacement for taking better measurements has a very low entry barrier.
In my perception, benchexec
is the "advanced" version, which mainly caters towards people setting up a competition / large evaluation, while runexec
is interesting for "everyone" (at least based on my personal experience of telling a few people about BenchExec).
The part about system requirements is now duplicated in README.md. Is this intended?
Yes, IMO the README should briefly summarize what BenchExec offers and what it needs so that one can make a judgement. If you have to click through several files just to see what all the main features and requirements are, I am rather certain that people will simply not care enough. After all, there is no "pressure" to use this, /usr/bin/time
still is the "gold standard" for evaluations :-(
Adding to README
The new text starts with "BenchExec is more than just the tool benchexec", but at this point the reader has never heard of benchexec so far. We would need to start differently.
I'd rather remove it from the other places. Without runexec being mentioned there, I wouldn't look further. From personal experience: I did briefly look at BenchExec several years ago during my PhD and thought BenchExec is just
benchexec
, which looked way too complicated for what I wanted to do (run ~10 runs of a single tool).runexec
I would have definitely appreciated and used.
I understand, and we will follow this advice. Thank you!
What do you think about the following as an extension of the current feature list?
BenchExec provides three major features:
- execution of arbitrary commands with precise and reliable measurement and limitation of resource usage (e.g., CPU time and memory), and isolation against other running processes
(This is provided by doc/runexec.md.) - an easy way to define benchmarks with specific tool configurations and resource limits, and automatically executing them on large sets of input files
(This is provided by doc/benchexec.md on top of
runexec
.) - generation of interactive tables and plots for the results
(This is provided by doc/table-generator.md for results produced with
benchexec
.)
For the "on top of" part I am unsure what the best wording is.
But overall I would hope that this makes it pretty clear that if you want the first feature that you can just have this and how, or doesn't it?
I prefer starting with benchexec: It is the main tool, and it provides the most features to users, so it is the one that we recommend for users. runexec is basically intended for those users who cannot use benchexec.
I would argue that (initially)
benchexec
is far less interesting thanrunexec
for most users. Setting upbenchexec
for a single tool evaluation likely is perceived as too cumbersome, while usingrunexec
as a drop-in replacement for taking better measurements has a very low entry barrier.
I am not so sure. If you start with benchmarking (e.g., as a new researcher), and you do not have all the scripts already that allow you to run sets of commands, collect the results, compare the results across different configurations etc., benchexec
should be much more helpful than runexec
.
In my perception,
benchexec
is the "advanced" version, which mainly caters towards people setting up a competition / large evaluation, whilerunexec
is interesting for "everyone" (at least based on my personal experience of telling a few people about BenchExec).
For me it is important that we do not imply this in the documentation. We do not want to scare people away from benchexec
. We want to make clear that it is actually not much effort to use benchexec
and you often need only two files, and it gives you a lot of features that would be complex and error-prone to implement in own scripts. benchexec
is not at all intended or targeted only for large evaluations or competitions. Even if you have just 10 runs and want the results in a CSV file I want benchexec
to sound like an interesting choice. So if according to your experience the opposite is happening, then it seems we need to improve the documentation in that regard in order to reverse this and make benchexec
more clear, not advertise runexec
more strongly.
The part about system requirements is now duplicated in README.md. Is this intended?
Yes, IMO the README should briefly summarize what BenchExec offers and what it needs so that one can make a judgement.
I meant to say that in this PR the system requirements appear twice inside README.md
.
What do you think about the following as an extension of the current feature list?
That sounds good!
I meant to say that in this PR the system requirements appear twice inside README.md.
oops that was a mistake :-)
I am not so sure. If you start with benchmarking (e.g., as a new researcher), and you do not have all the scripts already that allow you to run sets of commands, collect the results, compare the results across different configurations etc., benchexec should be much more helpful than runexec.
For me it is important that we do not imply this in the documentation. We do not want to scare people away from benchexec. We want to make clear that it is actually not much effort to use benchexec and you often need only two files, and it gives you a lot of features that would be complex and error-prone to implement in own scripts. benchexec is not at all intended or targeted only for large evaluations or competitions. Even if you have just 10 runs and want the results in a CSV file I want benchexec to sound like an interesting choice. So if according to your experience the opposite is happening, then it seems we need to improve the documentation in that regard in order to reverse this and make benchexec more clear, not advertise runexec more strongly.
I think the target audience you have in mind is different from what I think of :-) A notable fraction of people in my area barely know one programming language and "writing an XML file" or "write a python class implementing an interface" is a significant task. The implementations often aren't stable tools but prototypes. And yes, if these are evaluated on their own, there isn't much reason for precise measurements, but when comparing to existing tools, I think this is relevant.
I would argue that using benchexec
over runexec
is quite a bit of mental load. In my experience, several evaluations aren't even using scripts but just run the few invocations manually. Then, just writing runexec command
is much easier than writing definitions etc. Similarly, for most evaluations I have seen, the list of invocations isn't a nice cross product of options and models but specific options for each model. While all this possible with benchexec, it is some additional work.
Just for comparison: Using benchexec
will need a tool-info (which requires knowing basic python + how invocations work + reading up / testing how things are passed around) + an XML file for the definitions (which is not really a nice experience due to XML) + understand the terminology. This is a lot of effort for someone not well versed in programming. I would bet that the tool-info alone is enough to scare away most of the people I would want to convince of trying benchexec.
My main point is that currently there is no real reward for using BenchExec except "doing things right", which I believe is not enough to justify spending more than an hour on this for most PhD students, or, in any case, something quite a bit of supervisors won't appreciate. (Bluntly: Using benchexec over time
won't have any influence in the score of any paper in my community.) As such, I would find it nice if the README invokes the feeling of "if you want to measure properly, this really is not much effort and you don't have to rethink how you do evaluations at all, because there is a drop-in replacement, so no reason not to try it" + "if you want more stuff, there is more".
It is a question of whom you want to reach - I think it is totally fine if you want to cater towards an "experienced" crowd but for a beginner I am rather certain that benchexec inherently is "scary" (even though it is easy to use once you know how!)
Sorry for the long delay. It was a busy time.
What do you think about the following as an extension of the current feature list?
That sounds good!
Thanks, I will commit this. Step-wise improvements :-) Would you like to be accredited as a co-author of the commit?
It is a question of whom you want to reach - I think it is totally fine if you want to cater towards an "experienced" crowd
Not at all! Beginners have most to gain from using BenchExec and beginners are the most important people who should use BenchExec (instead of taking care of all the tricky details of benchmarking themselves).
but for a beginner I am rather certain that benchexec inherently is "scary" (even though it is easy to use once you know how!)
I know, of course, that there is some initial barrier for using BenchExec because you need to learn several things. And I am always glad about feedback and hints about how we could improve this by making it easier or having more documentation.
But I think that pushing people and especially beginners who know little about benchmarking into the direction of using less of BenchExec and more of their own hand-written scripts or manual steps is going in the wrong direction.
I think the target audience you have in mind is different from what I think of :-) A notable fraction of people in my area barely know one programming language and "writing an XML file" or "write a python class implementing an interface" is a significant task. The implementations often aren't stable tools but prototypes.
But this group of people would also struggle with having to write their own benchmarking scripts. So precisely for this group of people I would argue that using benchexec
to take care of a lot of things with regard to benchmarking is important and useful.
Just for comparison: Using
benchexec
will need a tool-info (which requires knowing basic python + how invocations work + reading up / testing how things are passed around) + an XML file for the definitions (which is not really a nice experience due to XML) + understand the terminology. This is a lot of effort for someone not well versed in programming. I would bet that the tool-info alone is enough to scare away most of the people I would want to convince of trying benchexec.
Even if they compare it against the effort of writing an own script that takes care of collecting all the benchmark runs and storing the results in a machine-readable way?
My main point is that currently there is no real reward for using BenchExec except "doing things right", which I believe is not enough to justify spending more than an hour on this for most PhD students, or, in any case, something quite a bit of supervisors won't appreciate.
If you just replace time
with runexec
, then there are fields where there is no real reward for using BenchExec except "doing things right" (because reviewers do not care about data quality, reproducibility, etc.), I agree. But if you do not use runexec
but benchexec
, you do get a lot of immediate benefits such as not having to write scripts (or at least less scripts), nice tables for browsing through the results, the possibility to spontaneously create a scatter plot during a meeting with your supervisor with just a few clicks, etc.
So no, I do not want to force people to use benchexec
and I do not want to hide runexec
(and I believe we are not doing so), but I do want to recommend benchexec
for most users instead of runexec
.
(Bluntly: Using benchexec over
time
won't have any influence in the score of any paper in my community.) As such, I would find it nice if the README invokes the feeling of "if you want to measure properly, this really is not much effort and you don't have to rethink how you do evaluations at all, because there is a drop-in replacement, so no reason not to try it" + "if you want more stuff, there is more".
I see the point for those users who already have such scripts and I agree that for those users your suggestion is a good selling strategy. Thank you! I think we can incorporate this without pushing everyone to runexec
first.
In order to lock in place what we have already and allow easier iteration, I have committed an attempt at this together with what was discussed before in 48cfd9bae81f0df719b9119adc5d33e121381c04. I would be nice if you could have a look whether you like it or have suggestions and whether you agree with being a co-author. Then I would push this to the main branch.
Afterwards, I would be glad to hear what you think is still missing between that commit and this PR and what the goals of the remaining changes are.
Sorry for the long delay. It was a busy time.
Can relate, no problem :)
Would you like to be accredited as a co-author of the commit?
No need, maybe when I contribute a larger chunk. But I also won't object :)
But this group of people would also struggle with having to write their own benchmarking scripts.
But they often are not aware that their scripts are a problem ;)
So I would see it as a compromise - make it easy to use the basic bits so at least that is done right. I agree that much more people should be using benchexec
which is exactly why I think there should be as little of a barrier to do the first steps as possible.
Even if they compare it against the effort of writing an own script that takes care of collecting all the benchmark runs and storing the results in a machine-readable way?
If only that were the case. Example from a recent artifact evaluation: 1) needed to manually modify the source code to change the model being run 2) was an IDE packaged inside a VM where you needed to change the run parameters inside the VM (i.e. no usable command line binary). Yes, these are extreme examples, but this is the sort of context that I want to make proper benchmarking appealing to. Most artifacts do not store results in a machine readable way.
(because reviewers do not care about data quality, reproducibility, etc.),
But I care about this! :) I see where we are going. My point here is: I want to be able to establish using runexec
or similar as minimal baseline for proper benchmarks. I regularly criticize papers because they are lacking basic documentation on how they run their experiments. And if there is a clean and simple "here is how to replace /usr/bin/time with something reasonable" README, then I can point to that. Yes, benchexec
is a profit for the authors, but especially runexec
is a profit for reviewers and the community because of consistency.
Again, I agree that benchexec
often is better than runexec
, but its easier to get started with runexec
as a drop-in replacement for measurements (especially because it actually also does resource limiting!).
I see the point for those users who already have such scripts and I agree that for those users your suggestion is a good selling strategy. Thank you! I think we can incorporate this without pushing everyone to runexec first.
Great! I like the changes in the commit. I think the only addition I would make is that runexec
also takes care of CPU and memory limits (compared to time
which only measures)
Great! I like the changes in the commit. I think the only addition I would make is that
runexec
also takes care of CPU and memory limits (compared totime
which only measures)
Thanks! Added and pushed to main.
My point here is: I want to be able to establish using
runexec
or similar as minimal baseline for proper benchmarks.
Thank you very much! We highly appreciate this (and not because pushing our tool in particular, but because of the general improvement to the state of how research is done).
I regularly criticize papers because they are lacking basic documentation on how they run their experiments. And if there is a clean and simple "here is how to replace /usr/bin/time with something reasonable" README, then I can point to that.
Oh, yes! This is a great suggestion and I definitively want to provide this now!
Do you think extending https://github.com/sosy-lab/benchexec/blob/main/doc/runexec.md would be a good place?
(and not because pushing our tool in particular, but because of the general improvement to the state of how research is done).
Same here :)
Do you think extending https://github.com/sosy-lab/benchexec/blob/main/doc/runexec.md would be a good place?
I am unsure. There are two separate points; 1. "this is what the tool can do" 2. "this is how to do basic proper benchmarks". I think runexec.md should be more a "documentation" of the tool, while the other should be a self-contained set of "instructions". I imagine something like this:
# Proper Measurements with Benchexec 101
## Guided Example
Establish small running example
### Step 1 Install benchexec
Run commands xy
-> Link to further details and troubleshooting
### Step 2 Consider basic setup
Important points: Time limit, resource limits, how to handle timeouts
-> Link to further information on benchmark design
### Step 3 Run tool with runexec
Concrete invocation and example output
-> Link to runexec doc
## Further resources
Integrate with your benchmarking script -> link to API
Automate executions and result gathering -> link to benchexec
Combining with docker -> link to notes on docker
etc.
Basically, I would like to be able to write on the webpage of an artifact evaluation "We strongly encourage authors to follow established best practices for benchmarking [cite to your paper]. See here [link to the above page] for a guide on how to do it." (or something like this) and similarly just paste this in a review. (I think AE is probably the best way to raise awareness as the reviewers can actually inspect the methodology.) Point being, I think it would be nice if this is a "landing page" that can be directly linked to, if you see what I mean?
I see, and I fully agree that this would be great to have. We could call it "Quick start guide" or "Runexec Tutorial" or so? It should then probably be its own page in doc/
.
Btw., also have a look at https://github.com/sosy-lab/benchexec/blob/main/doc/benchmarking.md. It also covers "stuff that you should do when benchmarking" and should be tightly coupled via links to what you are proposing, but probably keeping this checklist and the technical quick-start guide on separate pages is better.
I like "quickstart.md" a lot (its not exclusive about runexec
but rather concrete steps to improve benchmarking)
Yes, I saw benchmarking.md
I think it would be a good idea to condense the most important points / a set of "commandments" into the quickstart and then link to that with something like "for further information, see here"
i could sketch a quickstart if you want (but it will need some polishing / iteration I think)
i could sketch a quickstart if you want (but it will need some polishing / iteration I think)
I would be glad about this! Thank you a lot for your invaluable help!
Will do soon (tm); closing this MR then