benchexec icon indicating copy to clipboard operation
benchexec copied to clipboard

Feature: support multiproperty verification

Open mutilin opened this issue 8 years ago • 22 comments

This year we have added new demo category for SV-COMP'2017 called c/ldv-multiproperty. Each task in the category contains several properties to check. The properties are listed in the corresponding property file (c/ldv-multiproperty/ALL.prp)

    CHECK( init(main()), LTL(G ! call(__VERIFIER_error_<property>())) )
    ...

For each task we need to get verification result. The problem that we need verification result not only for the whole task, but for each property separately. Currently BenchExec provides only one verdict (status) for the verification task. It looks like we need some extension of BenchExec to support the following:

  1. Get status for each property separately.
  2. Set correspondence between witness file and violated property with status=unsafe.
  3. Specify ideal verdicts for the properties.

For items 2 and 3 it seems that we have a solution. For item 2 we can extract property from violated_property tag in the xml witness file. For 3 we have decided to add a file (<task>.verdict) for each verification task in the following format: <property>:<verdict> See examples in c/ldv-multiproperty.

For the first item we have an idea that the verdict should be represented in the benchmark result xml file, ilke

    <column title="status" property="<property1>" value="<status1>" />
    …
    <column title="status" property="<propertyN>" value="<statusN>" />

In BenchExec it may be implemented as the new interface function, like determine_multiproperty_result, which returns a map from property to status.

mutilin avatar Jan 12 '17 07:01 mutilin

  1. Set correspondence between witness file and violated property with status=unsafe.

For items 2 and 3 it seems that we have a solution. For item 2 we can extract property from /violated_property/ tag in the xml witness file.

What exactly do you mean with this? Does BenchExec need to handle witness files for multiproperty verification?

Currently, BenchExec itself does not know about witness files, only Dirk's additional scripts for witness validation do. Of course these scripts need to be updated, but I can't say anything about them.

  1. Specify ideal verdicts for the properties.

For 3 we have decided to add a file (.verdict) for each verification task in the following format:

<property>:<verdict>

See examples in c/ldv-multiproperty.

Wouldn't it be a good idea to use a standard language for the format of this file, such as YAML?

There is also a more general discussion of how to store metadata about verification tasks in sosy-lab/sv-benchmarks#213. If this is actually implemented in a way that adds a metadata file for each verification task, this should of course also contain the verdicts you are proposing.

So maybe keep the format extensible for the future? Or maybe now is a good time to discuss sosy-lab/sv-benchmarks#213?

  1. Get status for each property separately.

For the first item we have an idea that the verdict should be represented in the benchmark result xml file, ilke

<column title="status" property="<property1>" value="<status1>" />
…
<column title="status" property="<propertyN>" value="<statusN>" />

This would be a possibility, or alternatively

<column title="status (<property1>)" value="<status1>"/>

I suspect the latter would need less changes to e.g. table-generator.

What would the value of <property1> be in reality? Can you give an example? How is the mapping from the lines in the property file to the <property1> string done?

In BenchExec it may be implemented as the new interface function, like /determine_multiproperty_result,/ which returns a map from property to status.

Maybe a method determine_result_for_property that gets called multiple times, once for each property, would be more flexible and easier to implement in the tool-info modules? Then the case of single-property verification would just be a special case of multiproperty verification but doesn't need to be handled separately.

How do you want to handle error results like crashes? Should each property then be set to "error" individually? Or do you want an additional main status, and set the property status columns only when main status is not an error?

PhilippWendler avatar Jan 12 '17 08:01 PhilippWendler

What exactly do you mean with this?

I mean that for each propertyK we need to know the path to the witness file.

Does BenchExec need to handle witness files for multiproperty verification?

Currently, BenchExec itself does not know about witness files, only Dirk's additional scripts for witness validation do. Of course these scripts need to be updated, but I can't say anything about them.

Yes, the current solution I have in mind is outside BenchExec, but for LDV tools we need the same functionality as Dirk will use in his additional scripts. Thus, may be we need to have a library to handle witnesses.

Wouldn't it be a good idea to use a standard language for the format of this file, such as YAML?

I think it is a good idea

There is also a more general discussion of how to store metadata about verification tasks in sosy-lab/sv-benchmarks#213. If this is actually implemented in a way that adds a metadata file for each verification task, this should of course also contain the verdicts you are proposing.

So maybe keep the format extensible for the future? Or maybe now is a good time to discuss sosy-lab/sv-benchmarks#213?

For the multiproperty we just need verdicts, but I can not estimate how useful is the other information mentioned in sosy-lab/sv-benchmarks#213. So I prefer not to have time dependency on this task.

I suspect the latter would need less changes to e.g. table-generator.

Looks OK to me

What would the value of be in reality? Can you give an example? How is the mapping from the lines in the property file to the string done?

I suppose that it should be name of unreach function mentioned in the property, like __VERIFIER_error_linux_usb_urb

Maybe a method determine_result_for_property that gets called multiple times, once for each property, would be more flexible and easier to implement in the tool-info modules?

I think that it is flexible for tool-info developers, but may be not as fast as the method which determines all verdicts at once. It requires that Benchexec should know the format of input property file, but at the same time tool-info developers will not parse it themselves.

How do you want to handle error results like crashes? Should each property then be set to "error" individually? Or do you want an additional main status, and set the property status columns only when main status is not an error?

I think it is OK to have an "error" for each property individually. It is not clear how to set main status for all types of errors/crashes, because some errors/crashes occur only for specific properties.

mutilin avatar Jan 12 '17 14:01 mutilin

Currently, BenchExec itself does not know about witness files, only Dirk's additional scripts for witness validation do. Of course these scripts need to be updated, but I can't say anything about them.

Yes, the current solution I have in mind is outside BenchExec, but for LDV tools we need the same functionality as Dirk will use in his additional scripts. Thus, may be we need to have a library to handle witnesses.

Such a library could of course be useful, and could either be stored in the contrib folder in the BenchExec repository, or maybe even better in sosy-lab/sv-witnesses. I suggest talking to @dbeyer and @mdangl about this.

There is also a more general discussion of how to store metadata about verification tasks in sosy-lab/sv-benchmarks#213. If this is actually implemented in a way that adds a metadata file for each verification task, this should of course also contain the verdicts you are proposing.

So maybe keep the format extensible for the future? Or maybe now is a good time to discuss sosy-lab/sv-benchmarks#213?

For the multiproperty we just need verdicts, but I can not estimate how useful is the other information mentioned in #213. So I prefer not to have time dependency on this task.

I can understand that. Do you think there could a format that you can design now and extend later? Introducing one format now and then maybe replacing it with a different one a few months later would be cumbersome as well.

What would the value of be in reality? Can you give an example? How is the mapping from the lines in the property file to the string done?

I suppose that it should be name of unreach function mentioned in the property, like __VERIFIER_error_linux_usb_urb

Who would implement this mapping? BenchExec?

How do we support cases where other properties not based on reachability of an error function are also present? Or don't we need to support them?

That question leads me to a more general question: Currently there is already a case where property files contain multiple properties (for memsafety). This situation has different semantics and expectations for tools than the use case you want. So how do we distinguish between them? How would tools and BenchExec now that in one case it should just find a violation of any property, and in the other case it should try to find a violation of every property?

Maybe a method determine_result_for_property that gets called multiple times, once for each property, would be more flexible and easier to implement in the tool-info modules?

I think that it is flexible for tool-info developers, but may be not as fast as the method which determines all verdicts at once.

I think this should not be a performance problem, shouldn't it? Tools really should not output megabytes of text to stdout, after all, this is something that is intended to be looked at by humans, and humans typically can't read megabytes of log output.

It requires that Benchexec should know the format of input property file, but at the same time tool-info developers will not parse it themselves.

I would certainly not let the tool-info developers parse the property file anyway. Apart from the fact that this is error-prone, it is also simply highly redundant and not future-proof for changes of the format. BenchExec needs to parse the file in all cases and tell the tool-info module which properties it expects.

PhilippWendler avatar Jan 12 '17 18:01 PhilippWendler

Such a library could of course be useful, and could either be stored in the contrib folder in the BenchExec repository, or maybe even better in sosy-lab/sv-witnesses. I suggest talking to @dbeyer and @mdangl about this.

I asked @vmordan to think about interface we need for the library, but of course we do not know exactly what are the needs of sv-comp and this should be discussed with @dbeyer and @mdangl.

I can understand that. Do you think there could a format that you can design now and extend later? Introducing one format now and then maybe replacing it with a different one a few months later would be cumbersome as well.

I think we can provide a format which can be a part of proposal sosy-lab/sv-benchmarks#213 concerning specification of ideal verdicts. In such a way I think we may easily integrate the full functionality of sosy-lab/sv-benchmarks#213 once it is ready.

Who would implement this mapping? BenchExec?

I think that mapping should be defined in SV-COMP rules. The same as for memsafety where, for example, CHECK( init(main()), LTL(G valid-free) ) is mapped to valid-free.

How do we support cases where other properties not based on reachability of an error function are also present? Or don't we need to support them?

To support the mix of properties I suggest to add a prefix unreach-call to the error function name. For example, unreach-call,__VERIFIER_error_linux_usb_urb.

I also think that instead <column title="status (<property1>)" value="<status1>"/> we even could report status for property as a parameter for verdict as it is used now. Like <column title="status" value="false (property1)"/> for example <column title="status" value="false (unreach-call,__VERIFIER_error_linux_usb_urb)"/> We could even add prefix memsafety to memsafety properties for better look: memsafety,valid-free, memsafety,valid-deref.

So how do we distinguish between them? How would tools and BenchExec now that in one case it should just find a violation of any property, and in the other case it should try to find a violation of every property?

I agree we need to distinguish checking for the first violated property and checking for all properties when we do not stop after the first violation, but produce the results for each property. I see two ways:

  1. Let BenchExec know it, for example, by using a new tag <multiproperty></multiproperty> instead of <property>.
  2. BenchExec may not know about the way the tool checks multiproperty file at all. But these requires interface with tool-info that does not depend on the properties. For example, Benchexec may provide a set of properties to determine_result, and receive the set of verdicts
  • if the tool check for violation of the any property it returns one verdict
  • otherwise, it returns a set of verdicts for each property separately Second way clearly contradicts to suggestion of using determine_result_for_property.

I think this should not be a performance problem, shouldn't it?

My hypothesis is the same

mutilin avatar Jan 13 '17 10:01 mutilin

I asked @vmordan to think about interface we need for the library, but of course we do not know exactly what are the needs of sv-comp and this should be discussed with @dbeyer and @mdangl.

This library should provide the following functionality:

  • determine violated property by the given witness;
  • read configuration YAML file (see suggestion https://github.com/sosy-lab/sv-benchmarks/issues/213) with correct results;
  • provide scoring schema for evaluating results.

We do not know, how exactly scoring schema (including witness validation) is implemented, therefore this point requires further discussion.

vmordan avatar Jan 13 '17 12:01 vmordan

How do we support cases where other properties not based on reachability of an error function are also present? Or don't we need to support them?

To support the mix of properties I suggest to add a prefix unreach-call to the error function name. For example, unreach-call,__VERIFIER_error_linux_usb_urb.

I also think that instead <column title="status (<property1>)" value="<status1>"/> we even could report status for property as a parameter for verdict as it is used now. Like <column title="status" value="false (property1)"/> for example <column title="status" value="false (unreach-call,__VERIFIER_error_linux_usb_urb)"/>

This would mean we have multiple columns with the same title in the result. This would make it hard to generate meaningful tables (for example, how would we correlate the results for the same property across various runs?). The property should be part of the title, or added as additional attribute.

So how do we distinguish between them? How would tools and BenchExec now that in one case it should just find a violation of any property, and in the other case it should try to find a violation of every property?

I agree we need to distinguish checking for the first violated property and checking for all properties when we do not stop after the first violation, but produce the results for each property. I see two ways:

  1. Let BenchExec know it, for example, by using a new tag <multiproperty></multiproperty> instead of <property>.

Or possibly an attribute to <propertyfile>.

  1. BenchExec may not know about the way the tool checks multiproperty file at all. But these requires interface with tool-info that does not depend on the properties. For example, Benchexec may provide a set of properties to determine_result, and receive the set of verdicts
  • if the tool check for violation of the any property it returns one verdict
  • otherwise, it returns a set of verdicts for each property separately Second way clearly contradicts to suggestion of using determine_result_for_property.

I would prefer to not let the tool-info module decide this. After all, it is a fact about the verification task (the user decides that she wants to check several properties in a certain way), not something that comes from the concrete run or the tool. If the tool checks the properties in a different way from what the user expects, BenchExec needs to follow the user's way, not the tool's way.

PhilippWendler avatar Jan 17 '17 07:01 PhilippWendler

This would mean we have multiple columns with the same title in the result. This would make it hard to generate meaningful tables (for example, how would we correlate the results for the same property across various runs?). The property should be part of the title, or added as additional attribute.

I agree, having different titles is better.

Or possibly an attribute to <propertyfile>.

Also nice

I would prefer to not let the tool-info module decide this. After all, it is a fact about the verification task (the user decides that she wants to check several properties in a certain way), not something that comes from the concrete run or the tool. If the tool checks the properties in a different way from what the user expects, BenchExec needs to follow the user's way, not the tool's way.

OK

mutilin avatar Feb 01 '17 08:02 mutilin

How could we proceed with the enhancement?

mutilin avatar Feb 01 '17 08:02 mutilin

Best would be if you could create a pull request that contains everything for this feature: code, documentation, etc. Then I would be glad to review it.

PhilippWendler avatar Feb 01 '17 12:02 PhilippWendler

This enhancement was implemented in pull request https://github.com/sosy-lab/benchexec/pull/226.

Here is detailed description of changes:

  1. doc/multiproperty.md - added new documentation page with description of this enhancement. Also text of some other documentation pages was updated, link to this page was added.
  2. template.py - added method determine_result_for_property, which is aimed at getting status of selected property. In order to use multi-property verification this method needs to be overridden in corresponding tool wrapper.
  3. cpachecker.py - overrode method determine_result_for_property. Currently supports CPAchecker branch cmav-2, revision 24399, configuration -ldv-mav.
  4. model.py - added support of tag <multiproperty>. If this flag is set, then method determine_result_for_property will be called for each given property.
  5. outputhandler.py - added handling of multi-property results to put them in xml file and to compute a score.
  6. result.py:
    • added support of properties with any error function name: LTL(G ! call(<function_name>()));
    • changed function properties_of_file in order to support several properties in the property file;
    • added partial support of YAML configuration files in accordance with https://github.com/sosy-lab/sv-benchmarks/issues/213 in order to get expected results for tasks (function _get_yaml_ideal_statuses);
    • added support of scores computation for multi-property verification (function score_for_task);
    • added support of determining whether verdict for task is correct or not (function get_result_category).

Here is an example from https://github.com/sosy-lab/sv-benchmarks/tree/master/c/ldv-multiproperty: multiproperty_example.zip.

Currently we have several questions about implementation:

  1. Do we need single (overall) status in case of multi-property verification for backward compatibility? As I understand, this status is used frequently (in results.txt file, for table generation etc.), where it is hard to present many properties at once. In case of global errors (such as parsing failed) it is also more clearly to present reason of this error only once. At the same time, this status does not show, which property was violated and which was hold.
  2. Do you agree with the suggested scoring schema, in which status for each property is evaluated separately?
  3. CPAchecker wrapper currently determines property name in the following way: it is either equal to 'valid-deref', 'valid-free', 'valid-memtrack', 'no-overflow', 'no-deadlock', 'termination' or 'unreach-call' otherwise. Therefore, the following output: Verification result: FALSE. Property violation found by chosen configuration. is interpreted to: false(unreach-call). Thus, in current implementation unreach-call is added to CPAchecker property name by default (unless it is equal to one of predefined value). Is it correct in general case?
  4. Currently only relevant part of https://github.com/sosy-lab/sv-benchmarks/issues/213 was implemented, which provides correct results. Also Python library pyyaml is required to use YAML format. Should this feature remain outside of https://github.com/sosy-lab/sv-benchmarks/issues/213, which is more general and does not intersect much with multi-property verification?

vmordan avatar Feb 17 '17 14:02 vmordan

I added my comments to the pull request.

For question 1, I am not sure. It depends a lot on what is represented in the main status in case of multi-property verification, and whether this is useful for users. Because I have no experience with multi-property verification, I can't answer this.

For question 2, this should be decided by SV-Comp. BenchExec just implements the SV-Comp scoring scheme.

For question 3, I don't really understand the question. Can't you give a precise output in CPAchecker?

For question 4, how would you implement multi-property verification if you keep the yaml feature out of BenchExec?

PhilippWendler avatar Feb 20 '17 16:02 PhilippWendler

Philipp, do I understand it correctly that you suggest to use yaml files as input for BenchExec instead of C ones? How can we represent it? As an attribute to existing tags, like <input kind="yaml">*.yaml</input> ? Do we need to support old way of specification of ideal verdicts inside filename and passing a single c file as input?

mutilin avatar Feb 25 '17 10:02 mutilin

Yes, this is the current state of the proposal.

There has been only one alternative presented so far, which is to continue giving a C file to BenchExec and let BenchExec guess a file name for a yaml file and use it if it's there. I think this is clearly more error-prone.

Do you have any other alternatives?

So far, we need to continue supporting the old style of giving a C file with verdicts in file name, yes. First, BenchExec should stay backwards-compatible, and second, this way might continue to be used for trivial cases (e.g., single C file, single property) because it reduces the number of files laying around by half.

I do not see any problem with supporting both styles.

However, note that these questions are not specific to multi-property verification and should thus probably be discussed in sosy-lab/sv-benchmarks#213.

PhilippWendler avatar Feb 25 '17 12:02 PhilippWendler

OK, I understood. I don't have alternatives)

mutilin avatar Feb 25 '17 14:02 mutilin

Regarding sosy-lab/sv-benchmarks#213, do you think we should pass to BenchExec 1) a yaml file which contains all the properties from the proposal and BenchExec should select relatives ones or 2) a yaml file which contains only relative properties and all the others are considered as an error?

mutilin avatar Feb 25 '17 14:02 mutilin

What do you mean by "relative" properties?

PhilippWendler avatar Feb 25 '17 15:02 PhilippWendler

I mean that some properties from sosy-lab/sv-benchmarks#213 may not be used by BechExec, but still present in yaml file. For example, categories seems to me as not relevant for BenchExec, but rather for sv-comp scripts.

mutilin avatar Feb 25 '17 16:02 mutilin

Regarding your example: categories is out of the current proposal AFAIK.

But anyway, there should be a schema and a versioning scheme for the yaml files coming out of sosy-lab/sv-benchmarks#213, and yes, BenchExec should accept any yaml file that has a matching version number and is correct according to the schema, even if there are keys in the file that are of no interest to BenchExec.

PhilippWendler avatar Feb 25 '17 16:02 PhilippWendler

Here is an example of tablegenerator result for multi-property verification (based on c/ldv-multiproperty tasks). Each task contains status for each property, global status and category, score is computed in the same way as in BenchExec.

Do you have any suggestions about possible improvements of this format for multi-property verification?

vmordan avatar Mar 02 '17 15:03 vmordan

Looks nice. Whether it is useful in practice you probably know better than me :-)

What does orange true and false mean? Typically true/false are either green or red.

Maybe also compute a score for each property alone?

There is currently no way to see the expected results in the table. Not sure if this is necessary, but could be nice to have. One possibility would be to have the property column that table-generator adds if not all rows have the same property (check SV-Comp's Overall tables for example). Then the entries in this column could be a link to the yaml file. Another solution would be to have the first column link to the yaml file, and when the user clicks on this some JS parses the yaml and displays it nicely (including links to the input files).

PhilippWendler avatar Mar 02 '17 16:03 PhilippWendler

There are two cases. First is a status for a property. Property statuses (such as status (unreach-call(__VERIFIER_error_linux_kernel_locking_spinlock))) true and false should be either green or red. I'm going to fix it. In this case this also will indicate their expected status. Thus expected status will be missing only for unknown property statuses. Do you agree to use pop-up tooltip to show expected status for each cell?

Second is a status for the whole benchmark. It describes the global status in case of multi-property verification, it is useful when a global problem occurred (exception, parsing failed) or some properties were not verified because of timeout or out of memory. In other cases it is not very informative.

I used colors to represent correspondence of properties to expected results, but as far as we have many properties instead of one we have many different situations. For example, if we have incorrect result for some property we mark global status as red (incorrect). If some property got unknown verdict we mark it orange, otherwise green.

vmordan avatar Mar 03 '17 12:03 vmordan

A tooltip would certainly be a good idea. Maybe even implement this independently in a separate PR for all status columns? (Can be done afterwards, as well, but you could file an issue for it if you like the idea.)

Rest of your description also sounds fine.

PhilippWendler avatar Mar 03 '17 15:03 PhilippWendler