benchexec icon indicating copy to clipboard operation
benchexec copied to clipboard

More flexible result handling for other properties, e.g., quantitative properties

Open PhilippWendler opened this issue 1 year ago • 3 comments

Factored out from #994:

In use cases like QComp the existing ways to define an expected verdict and classify the actual result into correct/wrong are not enough, because for example the expected verdict consists of a numeric value and an allowed error interval. To support this in BenchExec we need at least

  • [ ] a way to define different kinds of expected verdicts in the task-definition format
  • [ ] make [get_result_category] depend on the used property, e.g., by moving it into the Property class as a method

Then it also makes sense to create subclasses of Property as a way to group this nicely, e.g., have an SvcompProperty etc.

Likely we can keep the current assumption that a result category is one of correct, wrong, unknown, or error.

PhilippWendler avatar Feb 20 '24 14:02 PhilippWendler

Related comment: When an expected_verdict can be anything, does it make sense to write expected_verdict: "False(xy)" instead of adding subproperty: xy to the task-definition? (reasoning: subproperty is domain-specific)

Also: Right now, <propertyfile> can be filtered by expectedverdict. Should the expected verdict then be always a string / the value of that tag compared to the string representation of the expected_verdict? This would effectively limit to one-dimensional results.

incaseoftrouble avatar Feb 26 '24 19:02 incaseoftrouble

Related comment: When an expected_verdict can be anything, does it make sense to write expected_verdict: "False(xy)" instead of adding subproperty: xy to the task-definition? (reasoning: subproperty is domain-specific)

Well, let me quote you from #994:

Definitely in favor of additional keys (in general, I am against magic inside strings :) ). For example, we also would want to specify relative or absolute error, or might consider statistical methods (which take two parameters). Just seems more extensible this way :)

Of course in BenchExec we already have precedent of false(xy) as actual result. So we could consider this. But if we put something like the precision into a separate key, then we should make it consistently always in separate keys.

Also: Right now, <propertyfile> can be filtered by expectedverdict. Should the expected verdict then be always a string / the value of that tag compared to the string representation of the expected_verdict? This would effectively limit to one-dimensional results.

Hm, what use cases would there even be for filtering numeric expected verdicts? I guess we can leave this out in beginning?

PhilippWendler avatar Feb 27 '24 08:02 PhilippWendler

But if we put something like the precision into a separate key, then we should make it consistently always in separate keys.

But that would be part of the query, not the expected verdict, right? So, for (single dimensional) quantitative properties, I think the expected_verdict would always be float.

Definitely in favor of additional keys (in general, I am against magic inside strings :) )

That was under the (wrong) assumption that precision would go into the verdict (which doesn't make sense after thinking about it). I like False(xy) more than a "sibling" key to expected verdict. Allowing for domain-specific magic is IMO ok for conciseness.

Hm, what use cases would there even be for filtering numeric expected verdicts?

Not much, I think. On that note, I think that the expectedverdict filter is a bit ad-hoc in this regard. (a special case of (derived) tags :) )

incaseoftrouble avatar Feb 27 '24 10:02 incaseoftrouble