sv-benchmarks icon indicating copy to clipboard operation
sv-benchmarks copied to clipboard

Score schema for the SoftwareSystems category

Open mikhailramalho opened this issue 8 years ago • 8 comments

Hi all,

I'm running some tests now and just saw the distribution for the SoftwareSystems category:

Systems_BusyBox_MemSafety: 52 benchmarks Systems_BusyBox_Overflows: 52 benchmarks Systems_DeviceDriversLinux64_ReachSafety: 2796 benchmarks

Given our meta category score calculation, this is pretty uneven. For comparison:

                              | SoftwareSystems | Overall |
1 correct true      (busybox) |              37 |      19 |
1 correct true    (DevDriver) |               1 |       1 |
------------------------------|-----------------|----------
1 correct false     (busybox) |              19 |      10 |
1 correct false   (DevDriver) |               0 |       0 | 
------------------------------|-----------------|----------
1 incorrect true    (busybox) |            -595 |    -306 |
1 incorrect true  (DevDriver) |             -11 |      -6 | 
------------------------------|-----------------|----------
1 incorrect false   (busybox) |            -297 |    -153 |
1 incorrect false (DevDriver) |              -6 |      -3 | 

Please, can someone review my calculation and check if it's correct? If it is correct, we need to do something about it; the 52 busybox benchmarks should not have the same weight as almost 3000 device driver benchmarks.

mikhailramalho avatar Dec 13 '16 14:12 mikhailramalho

The weighting schema in SV-COMP is designed to equally weight sub-categories.

This means, verifying the property MemSafety is as important as verifying Overflows or ReachSafety. In other words, a verifier that can only do MemSafety is not worse than a verifier that can only do ReachSafety. Sounds reasonable to me.

The other argument was: Few verification tasks of one kind should not weight less than many verification tasks of another kind. Suppose I have a verifier that is good in sub-category X, then I would submit hundreds of verification tasks to make that sub-category bigger and my chances to win much higher.

Suppose we make the ReachSafety tasks count more more: The result would be that verifiers that are good in checking ReachSafety have an advantage. But SV-COMP wants: If you implement a feature for supporting ReachSafety, then this is worth not more than implementing a feature for supporting MemSafety.

Re the example: The mentioned calculation is incorrect because the normalization is missing. But the conclusion is still correct: Scores in small sub-categories are larger than in large sub-categories.

On 2016-12-13 03:23 PM, Mikhail Ramalho wrote:

Hi all,

I'm running some tests now and just saw the distribution for the SoftwareSystems category:

Systems_BusyBox_MemSafety: 52 benchmarks Systems_BusyBox_Overflows: 52 benchmarks Systems_DeviceDriversLinux64_ReachSafety: 2796 benchmarks

Given our meta category score calculation, this is pretty uneven. For comparison:

|| SoftwareSystems | Overall | 1 correct true (busybox) | 37 | 19 | 1 correct true (DevDriver) | 1 | 1 | ------------------------------|-----------------|---------- 1 correct false (busybox) | 19 | 10 | 1 correct false (DevDriver) | 0 | 0 | ------------------------------|-----------------|---------- 1 incorrect true (busybox) | -595 | -306 | 1 incorrect true (DevDriver) | -11 | -6 | ------------------------------|-----------------|---------- 1 incorrect false (busybox) | -297 | -153 | 1 incorrect false (DevDriver) | -6 | -3 | |

Please, can someone review my calculation and check if it's correct? If it is correct, we need to do something about it; the 52 busybox benchmarks should not have the same weight as almost 3000 device driver benchmarks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sosy-lab/sv-benchmarks/issues/336, or mute the thread https://github.com/notifications/unsubscribe-auth/ACzgUFSdh6D404PshUAJodtIlIz6nUi0ks5rHqpagaJpZM4LLyRW.

dbeyer avatar Dec 13 '16 15:12 dbeyer

I agree with @mikhailramalho. I think that the problem is not in scoring schema, but in subcategory arrangements. I think that tools checking memsafety, overflows and reachability are not the competitors on the software market, but complement each other. Thus, I think they should be evaluated separatelly. I suggest to move a) Systems_BusyBox_MemSafety -> 2. MemorySafety b) Systems_BusyBox_Overflows -> 4. Overflows

mutilin avatar Dec 14 '16 10:12 mutilin

You forgot: c) Systems_DeviceDrivers_ReachSafety -> 1. ReachSafety

Yes, this would be consequent, i.e., have only properties as top-level classification.

But: The purpose of the software-systems category was to put an emphasis on real software, i.e., apply verifiers to industrial code. This includes checking several properties. Or do you consider MemorySafety and Overflows less important for industrial software?

Do you really think that the ReachSafety property is alone so important as to get a second own category?

dbeyer avatar Dec 14 '16 11:12 dbeyer

Of course, MemorySafety and Overflows are equally important for industrial software (frequently MemSafety errors are a basis for vulnerabilities). In perspective I would like to see three categories: Systems_MemSafety, Systems_Overflows, Systems_ReachSafety. For now, I think that Systems_BusyBox_MemSafety and Systems_BusyBox_Overflows need more benchmarks to become the separate categories.

mutilin avatar Dec 14 '16 12:12 mutilin

Wait, I am lost here. Are few busy box benchmarks going to be worth the same as thousands of device driver benchmarks? Is that what will happen if categories remain the way they are now?

zvonimir avatar Dec 14 '16 22:12 zvonimir

@mutilin: Yes, I agree. But nobody added those more programs. @zvonimir: What is it worth if you check error reachability for thousands of programs but do not find the security vulnerability in one MemorySafety program? We have this effect in every category: in categories with few tasks, the single verification task counts more. The goal being to have sub-categories with the same value.

dbeyer avatar Dec 14 '16 22:12 dbeyer

In this category this is crazy pronounced. Rechability error can easily be a security vulnerability (depending on what the assertion encodes), so that is a flawed argument for how "important" categories are. I just think it makes no sense to weight categories where one category contains 50 benchmarks and the other one 5000 benchmarks. There is no way that 1 busy box benchmarks is worth 100 device driver benchmarks, especially given that busy box benchmarks are really not that special. In fact, I would say that they are inferior to device driver benchmarks in their quality.

zvonimir avatar Dec 14 '16 22:12 zvonimir

Anyhow, enough said. Given that we decided to freeze the categories, maybe we should not make any changes at this point.

zvonimir avatar Dec 14 '16 22:12 zvonimir