benchexec Documentation: Part I

Improve README with usage notes

Apr 06 '24 04:04 incaseoftrouble

Thanks! Now I can see much better what the changes are.

Some thoughts that come to my mind when reading this:

Right now we have two places where each tools mentioned / described: A quick overview at the top of doc/INDEX.md and the documentation of each tool itself. This PR adds a third place (README.md). And actually we have kind of yet another place if we count the "features" list in the readme, which although not explicitly mapped to the tools does correspond). I wonder whether that becomes too much. I agree that we need a much better overview description than what we currently have in INDEX.md but maybe it makes more sense to just extend that part (also with a figure) instead of adding more to README.md? Would you have found it at the top of the documentation when you started trying to understand BenchExec? Should we reference each tool from the feature (in the feature list) that it provides?
The new text starts with "BenchExec is more than just the tool benchexec", but at this point the reader has never heard of benchexec so far. We would need to start differently.
I prefer starting with benchexec: It is the main tool, and it provides the most features to users, so it is the one that we recommend for users. runexec is basically intended for those users who cannot use benchexec.
The part about system requirements is now duplicated in README.md. Is this intended?

What do you think about this? I would highly appreciate hearing your thoughts about these questions, because it is not easy to consider what is the best reading flow for new people.

Apr 06 '24 15:04 PhilippWendler

Adding to README

The new text starts with "BenchExec is more than just the tool benchexec", but at this point the reader has never heard of benchexec so far. We would need to start differently.

I'd rather remove it from the other places. Without runexec being mentioned there, I wouldn't look further. From personal experience: I did briefly look at BenchExec several years ago during my PhD and thought BenchExec is just benchexec, which looked way too complicated for what I wanted to do (run ~10 runs of a single tool). runexec I would have definitely appreciated and used.

I prefer starting with benchexec: It is the main tool, and it provides the most features to users, so it is the one that we recommend for users. runexec is basically intended for those users who cannot use benchexec.

I would argue that (initially) benchexec is far less interesting than runexec for most users. Setting up benchexec for a single tool evaluation likely is perceived as too cumbersome, while using runexec as a drop-in replacement for taking better measurements has a very low entry barrier.

In my perception, benchexec is the "advanced" version, which mainly caters towards people setting up a competition / large evaluation, while runexec is interesting for "everyone" (at least based on my personal experience of telling a few people about BenchExec).

The part about system requirements is now duplicated in README.md. Is this intended?

Yes, IMO the README should briefly summarize what BenchExec offers and what it needs so that one can make a judgement. If you have to click through several files just to see what all the main features and requirements are, I am rather certain that people will simply not care enough. After all, there is no "pressure" to use this, /usr/bin/time still is the "gold standard" for evaluations :-(

Apr 06 '24 16:04 incaseoftrouble

Adding to README

The new text starts with "BenchExec is more than just the tool benchexec", but at this point the reader has never heard of benchexec so far. We would need to start differently.

I'd rather remove it from the other places. Without runexec being mentioned there, I wouldn't look further. From personal experience: I did briefly look at BenchExec several years ago during my PhD and thought BenchExec is just benchexec, which looked way too complicated for what I wanted to do (run ~10 runs of a single tool). runexec I would have definitely appreciated and used.

I understand, and we will follow this advice. Thank you!

What do you think about the following as an extension of the current feature list?

BenchExec provides three major features:

execution of arbitrary commands with precise and reliable measurement and limitation of resource usage (e.g., CPU time and memory), and isolation against other running processes
(This is provided by doc/runexec.md.)
an easy way to define benchmarks with specific tool configurations and resource limits, and automatically executing them on large sets of input files (This is provided by doc/benchexec.md on top of runexec.)
generation of interactive tables and plots for the results (This is provided by doc/table-generator.md for results produced with benchexec.)

For the "on top of" part I am unsure what the best wording is.

But overall I would hope that this makes it pretty clear that if you want the first feature that you can just have this and how, or doesn't it?

I prefer starting with benchexec: It is the main tool, and it provides the most features to users, so it is the one that we recommend for users. runexec is basically intended for those users who cannot use benchexec.

I would argue that (initially) benchexec is far less interesting than runexec for most users. Setting up benchexec for a single tool evaluation likely is perceived as too cumbersome, while using runexec as a drop-in replacement for taking better measurements has a very low entry barrier.

I am not so sure. If you start with benchmarking (e.g., as a new researcher), and you do not have all the scripts already that allow you to run sets of commands, collect the results, compare the results across different configurations etc., benchexec should be much more helpful than runexec.

In my perception, benchexec is the "advanced" version, which mainly caters towards people setting up a competition / large evaluation, while runexec is interesting for "everyone" (at least based on my personal experience of telling a few people about BenchExec).

For me it is important that we do not imply this in the documentation. We do not want to scare people away from benchexec. We want to make clear that it is actually not much effort to use benchexec and you often need only two files, and it gives you a lot of features that would be complex and error-prone to implement in own scripts. benchexec is not at all intended or targeted only for large evaluations or competitions. Even if you have just 10 runs and want the results in a CSV file I want benchexec to sound like an interesting choice. So if according to your experience the opposite is happening, then it seems we need to improve the documentation in that regard in order to reverse this and make benchexec more clear, not advertise runexec more strongly.

The part about system requirements is now duplicated in README.md. Is this intended?

Yes, IMO the README should briefly summarize what BenchExec offers and what it needs so that one can make a judgement.

I meant to say that in this PR the system requirements appear twice inside README.md.

Apr 07 '24 18:04 PhilippWendler

What do you think about the following as an extension of the current feature list?

That sounds good!

I meant to say that in this PR the system requirements appear twice inside README.md.

oops that was a mistake :-)

I am not so sure. If you start with benchmarking (e.g., as a new researcher), and you do not have all the scripts already that allow you to run sets of commands, collect the results, compare the results across different configurations etc., benchexec should be much more helpful than runexec.

For me it is important that we do not imply this in the documentation. We do not want to scare people away from benchexec. We want to make clear that it is actually not much effort to use benchexec and you often need only two files, and it gives you a lot of features that would be complex and error-prone to implement in own scripts. benchexec is not at all intended or targeted only for large evaluations or competitions. Even if you have just 10 runs and want the results in a CSV file I want benchexec to sound like an interesting choice. So if according to your experience the opposite is happening, then it seems we need to improve the documentation in that regard in order to reverse this and make benchexec more clear, not advertise runexec more strongly.

I think the target audience you have in mind is different from what I think of :-) A notable fraction of people in my area barely know one programming language and "writing an XML file" or "write a python class implementing an interface" is a significant task. The implementations often aren't stable tools but prototypes. And yes, if these are evaluated on their own, there isn't much reason for precise measurements, but when comparing to existing tools, I think this is relevant.

I would argue that using benchexec over runexec is quite a bit of mental load. In my experience, several evaluations aren't even using scripts but just run the few invocations manually. Then, just writing runexec command is much easier than writing definitions etc. Similarly, for most evaluations I have seen, the list of invocations isn't a nice cross product of options and models but specific options for each model. While all this possible with benchexec, it is some additional work.

Just for comparison: Using benchexec will need a tool-info (which requires knowing basic python + how invocations work + reading up / testing how things are passed around) + an XML file for the definitions (which is not really a nice experience due to XML) + understand the terminology. This is a lot of effort for someone not well versed in programming. I would bet that the tool-info alone is enough to scare away most of the people I would want to convince of trying benchexec.

My main point is that currently there is no real reward for using BenchExec except "doing things right", which I believe is not enough to justify spending more than an hour on this for most PhD students, or, in any case, something quite a bit of supervisors won't appreciate. (Bluntly: Using benchexec over time won't have any influence in the score of any paper in my community.) As such, I would find it nice if the README invokes the feeling of "if you want to measure properly, this really is not much effort and you don't have to rethink how you do evaluations at all, because there is a drop-in replacement, so no reason not to try it" + "if you want more stuff, there is more".

It is a question of whom you want to reach - I think it is totally fine if you want to cater towards an "experienced" crowd but for a beginner I am rather certain that benchexec inherently is "scary" (even though it is easy to use once you know how!)

Apr 08 '24 15:04 incaseoftrouble

Sorry for the long delay. It was a busy time.

What do you think about the following as an extension of the current feature list?

That sounds good!

Thanks, I will commit this. Step-wise improvements :-) Would you like to be accredited as a co-author of the commit?

It is a question of whom you want to reach - I think it is totally fine if you want to cater towards an "experienced" crowd

Not at all! Beginners have most to gain from using BenchExec and beginners are the most important people who should use BenchExec (instead of taking care of all the tricky details of benchmarking themselves).

but for a beginner I am rather certain that benchexec inherently is "scary" (even though it is easy to use once you know how!)

I know, of course, that there is some initial barrier for using BenchExec because you need to learn several things. And I am always glad about feedback and hints about how we could improve this by making it easier or having more documentation.

But I think that pushing people and especially beginners who know little about benchmarking into the direction of using less of BenchExec and more of their own hand-written scripts or manual steps is going in the wrong direction.

I think the target audience you have in mind is different from what I think of :-) A notable fraction of people in my area barely know one programming language and "writing an XML file" or "write a python class implementing an interface" is a significant task. The implementations often aren't stable tools but prototypes.

But this group of people would also struggle with having to write their own benchmarking scripts. So precisely for this group of people I would argue that using benchexec to take care of a lot of things with regard to benchmarking is important and useful.

Just for comparison: Using benchexec will need a tool-info (which requires knowing basic python + how invocations work + reading up / testing how things are passed around) + an XML file for the definitions (which is not really a nice experience due to XML) + understand the terminology. This is a lot of effort for someone not well versed in programming. I would bet that the tool-info alone is enough to scare away most of the people I would want to convince of trying benchexec.

Even if they compare it against the effort of writing an own script that takes care of collecting all the benchmark runs and storing the results in a machine-readable way?

My main point is that currently there is no real reward for using BenchExec except "doing things right", which I believe is not enough to justify spending more than an hour on this for most PhD students, or, in any case, something quite a bit of supervisors won't appreciate.

If you just replace time with runexec, then there are fields where there is no real reward for using BenchExec except "doing things right" (because reviewers do not care about data quality, reproducibility, etc.), I agree. But if you do not use runexec but benchexec, you do get a lot of immediate benefits such as not having to write scripts (or at least less scripts), nice tables for browsing through the results, the possibility to spontaneously create a scatter plot during a meeting with your supervisor with just a few clicks, etc.

So no, I do not want to force people to use benchexec and I do not want to hide runexec (and I believe we are not doing so), but I do want to recommend benchexec for most users instead of runexec.

(Bluntly: Using benchexec over time won't have any influence in the score of any paper in my community.) As such, I would find it nice if the README invokes the feeling of "if you want to measure properly, this really is not much effort and you don't have to rethink how you do evaluations at all, because there is a drop-in replacement, so no reason not to try it" + "if you want more stuff, there is more".

I see the point for those users who already have such scripts and I agree that for those users your suggestion is a good selling strategy. Thank you! I think we can incorporate this without pushing everyone to runexec first.

In order to lock in place what we have already and allow easier iteration, I have committed an attempt at this together with what was discussed before in 48cfd9bae81f0df719b9119adc5d33e121381c04. I would be nice if you could have a look whether you like it or have suggestions and whether you agree with being a co-author. Then I would push this to the main branch.

Afterwards, I would be glad to hear what you think is still missing between that commit and this PR and what the goals of the remaining changes are.

May 21 '24 14:05 PhilippWendler

Sorry for the long delay. It was a busy time.

Can relate, no problem :)

Would you like to be accredited as a co-author of the commit?

No need, maybe when I contribute a larger chunk. But I also won't object :)

But this group of people would also struggle with having to write their own benchmarking scripts.

But they often are not aware that their scripts are a problem ;)

So I would see it as a compromise - make it easy to use the basic bits so at least that is done right. I agree that much more people should be using benchexec which is exactly why I think there should be as little of a barrier to do the first steps as possible.

Even if they compare it against the effort of writing an own script that takes care of collecting all the benchmark runs and storing the results in a machine-readable way?

If only that were the case. Example from a recent artifact evaluation: 1) needed to manually modify the source code to change the model being run 2) was an IDE packaged inside a VM where you needed to change the run parameters inside the VM (i.e. no usable command line binary). Yes, these are extreme examples, but this is the sort of context that I want to make proper benchmarking appealing to. Most artifacts do not store results in a machine readable way.

(because reviewers do not care about data quality, reproducibility, etc.),

But I care about this! :) I see where we are going. My point here is: I want to be able to establish using runexec or similar as minimal baseline for proper benchmarks. I regularly criticize papers because they are lacking basic documentation on how they run their experiments. And if there is a clean and simple "here is how to replace /usr/bin/time with something reasonable" README, then I can point to that. Yes, benchexec is a profit for the authors, but especially runexec is a profit for reviewers and the community because of consistency.

Again, I agree that benchexec often is better than runexec, but its easier to get started with runexec as a drop-in replacement for measurements (especially because it actually also does resource limiting!).

I see the point for those users who already have such scripts and I agree that for those users your suggestion is a good selling strategy. Thank you! I think we can incorporate this without pushing everyone to runexec first.

Great! I like the changes in the commit. I think the only addition I would make is that runexec also takes care of CPU and memory limits (compared to time which only measures)

May 21 '24 17:05 incaseoftrouble

Great! I like the changes in the commit. I think the only addition I would make is that runexec also takes care of CPU and memory limits (compared to time which only measures)

Thanks! Added and pushed to main.

My point here is: I want to be able to establish using runexec or similar as minimal baseline for proper benchmarks.

Thank you very much! We highly appreciate this (and not because pushing our tool in particular, but because of the general improvement to the state of how research is done).

I regularly criticize papers because they are lacking basic documentation on how they run their experiments. And if there is a clean and simple "here is how to replace /usr/bin/time with something reasonable" README, then I can point to that.

Oh, yes! This is a great suggestion and I definitively want to provide this now!

Do you think extending https://github.com/sosy-lab/benchexec/blob/main/doc/runexec.md would be a good place?

May 23 '24 13:05 PhilippWendler

(and not because pushing our tool in particular, but because of the general improvement to the state of how research is done).

Same here :)

Do you think extending https://github.com/sosy-lab/benchexec/blob/main/doc/runexec.md would be a good place?

I am unsure. There are two separate points; 1. "this is what the tool can do" 2. "this is how to do basic proper benchmarks". I think runexec.md should be more a "documentation" of the tool, while the other should be a self-contained set of "instructions". I imagine something like this:

# Proper Measurements with Benchexec 101

## Guided Example

Establish small running example

### Step 1 Install benchexec
Run commands xy
-> Link to further details and troubleshooting

### Step 2 Consider basic setup
Important points: Time limit, resource limits, how to handle timeouts
-> Link to further information on benchmark design

### Step 3 Run tool with runexec
Concrete invocation and example output
-> Link to runexec doc

## Further resources

Integrate with your benchmarking script -> link to API
Automate executions and result gathering -> link to benchexec
Combining with docker -> link to notes on docker
etc.

Basically, I would like to be able to write on the webpage of an artifact evaluation "We strongly encourage authors to follow established best practices for benchmarking [cite to your paper]. See here [link to the above page] for a guide on how to do it." (or something like this) and similarly just paste this in a review. (I think AE is probably the best way to raise awareness as the reviewers can actually inspect the methodology.) Point being, I think it would be nice if this is a "landing page" that can be directly linked to, if you see what I mean?

May 24 '24 06:05 incaseoftrouble

I see, and I fully agree that this would be great to have. We could call it "Quick start guide" or "Runexec Tutorial" or so? It should then probably be its own page in doc/.

Btw., also have a look at https://github.com/sosy-lab/benchexec/blob/main/doc/benchmarking.md. It also covers "stuff that you should do when benchmarking" and should be tightly coupled via links to what you are proposing, but probably keeping this checklist and the technical quick-start guide on separate pages is better.

May 24 '24 07:05 PhilippWendler

I like "quickstart.md" a lot (its not exclusive about runexec but rather concrete steps to improve benchmarking)

Yes, I saw benchmarking.md I think it would be a good idea to condense the most important points / a set of "commandments" into the quickstart and then link to that with something like "for further information, see here"

i could sketch a quickstart if you want (but it will need some polishing / iteration I think)

May 24 '24 10:05 incaseoftrouble

i could sketch a quickstart if you want (but it will need some polishing / iteration I think)

I would be glad about this! Thank you a lot for your invaluable help!

May 24 '24 12:05 PhilippWendler

Will do soon (tm); closing this MR then

May 24 '24 13:05 incaseoftrouble

benchexec benchexec copied to clipboard

Documentation: Part I

benchexec
benchexec copied to clipboard