vnncomp2021
vnncomp2021 copied to clipboard
Rules Discussion
A preliminary document with the proposed competition rules is linked here.
Please use this issue to discuss any aspects or possible changes to the rules. Also please post if you think the rules are fine as is, just to let us know people have looked at them.
@stanleybak Thank you for organizing the competition and drafting the competition rules! I have a comment on the machine type used for the competition.
The GPU (Tesla T4) on the g4dn.8xlarge
instance is too weak in my opinion. The benchmark shows that it is even slower than a $339 RTX 2060 released in 2019. Tesla T4 is a low-profile GPU focusing on light load (e.g., inference) and its performance is far inferior compared to the current generation mainstream GPUs.
I think to truly represent the performance of GPU based verifiers in real world scenarios we probably need a "normal" mainstream (not low-profile) GPU. I think a Tesla P100 is a fair choice (although P100 is also three generations old and not ideal, V100 or A100 may be too expensive to rent, and I understand that people have limited budget). I guess one reason for this is probably that other GPU instance types on AWS do not provide a sufficient amount of CPUs (e.g., p3.2xlarge with 1x V100 only provides 8 CPUs and 61 GB memory). Maybe we can use Google Cloud Compute instead which allows us to customize machine type (e.g., we can create a customized type with 16 CPUs, 128 GB memory and a P100 for around $1.7 per hour).
Hi All
@stanleybak @changliuliu - Thanks very much for organising VNN again - it is great to see another event!
I went through the rules - we will be happy to comment on the specifics. However there is something that I think may be problematic for a number of groups; it certainly is for us. It relates to the proposed rule that only one project per group may participate. On our side we have two largely independent efforts on NN verification. They are based on different methods and they are led by different people. We see VNN as a fantastic learning experience for participants and a friendly get together for all researchers in the area. As you would expect we'd love to see both teams benefit from the VNN experience and share the results and the code with the broader community after all the work that's gone into this.
Can we envisage a system whereby genuinely different tools can still participate even if they were developed in the same department? Alternatively, it may be that all that is required is that "group" is simply interpreted to mean the lead developers?
Thank you for considering this.
Best
-Alessio
@alessiolomuscio I would personally be okay to have multiple tools from one group, assuming they're sufficiently different (say, something like less than 50% shared code base). The intention with this rule was that there's some manual overhead and cost associated with evaluating each tool, and we didn't want a group to submit 20 versions of their same tool with slightly different parameters for evaluation.
@huanzhang12 A more powerful GPU is probably an option, within reason. It's a bit hard to achieve fairness with CPU only tools. Maybe roughly equal cost per hour of cloud time is one way to offer a fair comparison? That way CPU-only tools could use a more powerful CPU and GPU tools would use their specialized hardware. Any thoughts on this?
@stanleybak I can see the rationale for this and I entirely agree it would not be good to have multiple versions of the same tool tuned in with different hyper-parameters. The solution you propose makes perfect sense to me.
@stanleybak Yes I think it makes sense to use "equal cost per hour of cloud time" to make a fair comparison.
I configured two Google cloud instances below with roughly the same cost. The GPU instance has 1x P100 GPU and 8 CPUs. The CPU instance has no GPU but 4x more CPUs and 4x memory size to match the cost of the GPU instance. What do you think?
GPU instance, 8 CPUs, 48 GB memory, 1x NVIDIA P100 GPU, $1.359 per hour:
CPU instance, 32 CPUs (4x the instance above), 192 GB memory (4x the instance above), $1.354 per hour (roughly the same cost)
@stanleybak Thank you for organizing the competition and drafting the competition rules!
I have a few suggestions/requests for clarification:
1.) Runtime caps: In the current draft it seems like the idea is to set the per benchmark cap = per instance cap * # instances
. This would make it somewhat redundant (with the exception of being a guide for benchmark proposers). Is this the intention or is the idea to lower the per benchmark cap to favour fast (mean) runtimes?
2.) Property selection: One way to encourage generic certifiers for the image classification domain would be to use just one block of images from the corresponding test sets (e.g. the first or last 100 images) and proof the robustness of all (correct) classifications, perhaps we could define all corresponding benchmarks this way. (I saw now that using random samples is suggested in the benchmark discussion. perhaps this should be noted in the rules)
3.) Scoring: The distinction between "hard" and "easy" adversarial examples seems difficult to make and perhaps depend to a large extent on random seeds. Therefore, in line with the focus on verification, I would suggest assigning the same, low value to all found adversarial examples (e.g. 1 point)
4.) Time Bonus: Currently, it is not specified how points will be assigned when two tools are considered to have the same runtime. I would suggest the following: If there are two or more fastest methods: all get two points, no other method receives points for the second-fastest time. If there are two or more second-fastest methods: they all get one point.
5.) Timeline: The current timeline does not seem to clearly align with the proposed phases of this year's competition. Perhaps we can begin the Measurement Phase mid June?
Also more generally, I would suggest using new, different properties for this year's competition to avoid overfitting of methods and provide the community with a new set of (perhaps more challenging) benchmarks to track the progress in the coming year, as performance differences become harder to judge when most methods can solve the vast majority of benchmark tasks.
Thanks again for all the work put into this competition and I look forward to hearing the opinion of the other participants on these points
Best, Mark
@huanzhang12 Two two-types-of-cloud-instances approach sounds like a good change, with roughly the same cost. We may tweak some of the specific parameters but google compute cloud does look like it's more customizable from this perspective, so we will look into using that instead of AWS.
@mnmueller Thanks for the feedback.
-
In terms of runtime caps, the benchmark proposers can choose the number of instances, so they can select with lots of benchmarks that could be solved quickly, or a small number of benchmarks that require more computation time. The only intention is to have roughly the same total runtime per benchmark, so that we can have a reasonable cap on total runtime for all the benchmarks. Last year there were benchmarks with 6 hour timeouts, where if every benchmark reached its timeout, we would need about a week to check all instances. We want to avoid that situation.
-
Yes, we plan to use random image selection wherever possible to prevent over-fitting to specific images or cherry-picking specific images where methods work well. The plan was to fix the number of images and then select them based on a random seed that we decide at competition time, rather than the first 100 or last 100. We can update the rules to make this more clear. Do you think using first first 100 or last 100 is better than random?
-
Yes this is right. I think for this the plan was to run some basic adversarial example toolbox (like foolbox with PGD attack), and then if that succeeds we consider it an easy instance. Maybe you're right though, it's easier to just give 1 point to all adversarial images and skip the distinction between easy and hard instances. Do other people have an opinion on this?
-
This seems reasonable.
-
Yes we plan to do measurements in June... I think there will be a few sub steps for this, such as authors logging into cloud instances to download licenses and confirm their scripts are working.
In terms of "new, different properties for this year's competition", this is an interesting point. I think what they do in other competitions like SAT is they throw away the easy instances each year that everyone solved but keep around the ones that only a few tools or no tools solved. Then again, they have many many benchmarks to choose from, so have the luxury of excluding some. I'm not sure if we're there yet, in terms of having a variety of interesting benchmarks. Right now as written, it's up to the benchmark proposers to choose if they prefer to analyze a benchmark from last year or suggest a new one. Any other opinions on this?
I have updated the rules based on the feedback. Please check the updated document with the proposed changes (here) to see if there are any further changes suggested or if I missed anything important.
"Correct violated (where random tests or simple adversarial example generation did not succeed)" seems to have been removed, but "Correct violated (where random tests or simple adversarial example generation succeeded)" is now in green. Is this correct? What is the point value for "Correct violated (where random tests or simple adversarial example generation did not succeed)"? How are violations scored on properties other than local robustness (for example, a property stating that for all MNIST images, if the maximum class 3, then class 8 should have a higher value than class 7)? These are complex properties for which violations can be as informative as a holds result, and are not typical adversarial example generation problems.
@dlshriver Yes that was a mistake. I got rid of the "(where random tests or simple adversarial example generation succeeded)" in green.
Right now, the proposed change is all properties with violations would be worth 1 point. Would you be in favor of going back to the previous rules, where we do adversarial example generation first, and then if that fails we would have a higher score?
I guess right now the motivation for the change was the violation cases should be easy to find using standard adversarial example generation approaches and so less interesting (this is true even the the targeted class properties like the one you mentioned).
I personally liked the original idea of recognising the importance of generating hard counterexamples that could not be found in other ways just as much as verifying the result. I did wonder, however, how we were going to establish that. Would we generate these first centrally via some variant of adv search and filter out any result? If so, I guess each tool might add additional checks to exclude these not to miss out on the 10pts?
Also, it seems to me that, not dissimilarly from other areas (say BDD-based verification and SAT-based BMC), right now some methods are more effective for verification and others for falsification and we know from code verification that both aspects are important. It would follow that, under the previous rules, the composition of the actual test (as % of true vs false instances) would have a strong influence of the final result.
Perhaps rather than trying to establish the ideal ratio of true/false instances and how many points each win warrants, we might accept these are pretty different challenges and we could collect results for verifiers and falsifiers separately, or at least also separately and report them as verifiers, falsifiers, and some combined measure as above.
It genuinely is a tricky issue and I guess quite likely to influence to final score quite a bit.
@stanleybak Thank you for taking up the suggestions I made.
Regarding the choice of samples as a block vs fully random: I suggested using blocks of samples (perhaps randomly chose n for the block [n*100, (n+1)*100-1]) instead of a collection of random indices, as it is generally difficult to assess whether these random indices were generated truly randomly or if a selection process tailored to a specific method was used when reading/reviewing a paper. And while this is not an issue here, I would suggest setting an example of avoiding these sets of "random" indices.
Regarding the different evaluation platforms: Can only one platform be chosen per participant or can we use a CPU instance for one task and a GPU instance for a different benchmark?
Regarding the valuation of falsifying a property: I agree with @stanleybak that even in the case of complex properties, finding a counterexample seems qualitatively different and perhaps easier than certifying the absence of one and it should therefore be scored differently (on top of the issue of classifying counterexamples as easy vs hard).
I like @alessiolomuscio proposal of evaluating falsification separately and would suggest that for such an evaluation the use of any library specifically designed for counterexample generation (e.g. foolbox) should be disallowed.
I think it might be worth having separate scores for "violated" and "holds" problems in addition to a combined score where the results are weighted equally, similar to SMT-COMP which has a total score, as well as scores on only the sat results and only the unsat results.
Another way of filtering out "easy" violations is to simply not consider problems for the scoring function for which all verifiers returned a result in under X seconds. This filtering could be done in addition to the already proposed sampling or adversarial example generation approach.
I disagree with @mnmueller that finding a counterexample is easier than certifying the absence of one. Certainly for some problems it can be easy to find a counterexample, just as for some problems it can be easy to certify that none exist. For example, with robustness problems, as the radius of the epsilon ball increases, finding violations becomes easier, while as it decreases, certifying the problem becomes easier. Just as with traditional SAT problems, there is a phase transition where problems become difficult, with easy problems on either side.
I also disagree with the suggestion that certain libraries should be disallowed. I think the issue of having problems that are too easily violated is better mitigated by simply running adversarial example generation or random sampling methods to prune those problems from the evaluation. If a tool is able to use these libraries in a way that significantly improves their violation finding ability, I'm not convinced that shouldn't be allowed.
I think one of the goals of VNN-COMP is to understand better what methods are most effective for which problems. Part of this is determining what techniques work best on the current benchmarks. I don't think we should be excluding methods that can be used to solve these problems. We should want to better understand the space of all tools currently available and determine how we can leverage their techniques to solve these important problems better and faster.
Based on the discussion, I reverted the part about scoring for violations. If we find them using random testing / simple adversarial example generation they will be worth less points. I think this is better than banning specific libraries (or algorithms), as there may be neat research directions in combining adversarial example generation methods with verification algorithms. I think sometimes, but not always, finding violations can be easy, and this hopefully reflects this point.
I added mention of awards for best verifier and best falisfier.
For random image selection, I added that the organizers will select the random seeds, so as to ensure fairness. Hopefully this works, although we'll see as the benchmark proposal stage is starting now.