wycheproof DsaTest.testTiming() could use a warmup

I observed a lot of variability in the results from this DSA timing test. Sigma values on the same machine could vary by over 2 sigma. When looking at the individual timing results over multiple runs, it was clear there is a startup cost that is interfering with the check.

The test starts off with getCurrentThreadCpuTime() differences of: 3496000 1320000 1220000 1171000 1054000 1065000 1009000 968000 987000 1094000 1173000 1171000

After the first 100, we see the difference has narrowed: 387000 381000 362000 338000 348000 334000 348000

After 260 iterations, the values finally reach their lowest level: 175000 177000 165000 167000 171000 164000 162000 163000 171000 177000 173000 164000 166000

Now we are getting the true timing of the operation. Removing just the first 300 iterations halved the sigma values returned by the test and it was more consistent run over run. It is my belief that the variability of sigma values is a result of larger K values being randomly chosen more in the early runs that skews the sigma. These timing differences are likely caused by transitioning to math intrinsics from java code and/or the large allocation/GC related to the 50k BigIntegers and 50k longs just before the testing loop. Like many performance tests, there is a warmup sequence to remove startup costs.

The simplest fix for this is to throw away the first 300 results.

This is starting at line 516 of wycheproof/java/com/google/security/wycheproof/testcases/DsaTest.java

+   // To reduce noise in this test, a large sample is taken and a warmup
+   // to eliminate startup costs.
-    // The timings below are quite noisy. Thus we need a large number of samples.

+   int loops = 50300;
     int samples = 50000;
     int j = 0;
     long[] timing = new long[samples];
     BigInteger[] k = new BigInteger[samples];
+     for (int i = 0; i < loops; i++) {
-     for (int i = 0; i < samples; i++) {
         long start = bean.getCurrentThreadCpuTime();
         signer.update(messageBytes);
         byte[] signature = signer.sign();
+        long end = bean.getCurrentThreadCpuTime();
+        // Skip first 300 for warmup
+        if (i >= 300) {
+            timing[j] = end - start;
+            k[j++] = extractK(signature, h, priv, false);
+        }
-       timing[i] = bean.getCurrentThreadCpuTime() - start;
-       k[i] = extractK(signature, h, priv, false);
     }

Jul 11 '23 02:07 ascarpino

I don't think any changes are necessary here.

What the test is doing is to generate signatures, select a subset from those signatures based on timing information, and then checks if the k's used to generate those signatures are biased. If it is possible to select DSA signatures with small k's by selecting signatures that were generated faster than others, then the implementation has a weakness.

If an implementation uses uniformly distributed k's for DSA and ECDSA and does not leak timing information about the nonce then the test selects subsets of signatures with uniformly distributed k's and hence the result should be a normal distribution. Any disturbing factor such as garbage collection, warmup, load of the test server, overheating etc. does not change this distribution if the implementation is correct. This is an important property of the test, since its goal is be run regularly as a unit test. External influences must not be able to lead to false positives. Noise just makes it more difficult to detect a bias.

If the test result deviates significantly from a normal distribution, then this either means just bad luck or an actual bias. I suspect that the larger variance of the test results reported above were just caused by a small sample size.

There are a number of things that could potentially be done to improve the accuracy of the test. Obviously, generating more signatures gives better results. Better timing information would help, but unfortunately it is often difficult to influence the test environment. More detailed timing (e.g., time spent in particular functions) would also allow to improve the test.

Dec 21 '23 11:12 bleichenbacher-daniel

If an implementation uses uniformly distributed k's for DSA and ECDSA and does not leak timing information about the nonce then the test selects subsets of signatures with uniformly distributed k's and hence the result should be a normal distribution. Any disturbing factor such as garbage collection, warmup, load of the test server, overheating etc. does not change this distribution if the implementation is correct. This is an important property of the test, since its goal is be run regularly as a unit test. External influences must not be able to lead to false positives. Noise just makes it more difficult to detect a bias.

I would absolutely disagree with the premise that external factors, like warmup, gc, server load, etc., do not change the distribution. With the randomness of K, noise could be introduced at unfortunate times. That does not show weakness in the implementation, it shows weakness in the test. The test does try to mitigate some of this by a large allowance for sigma. But as the above results show, it could be hard for that allowance to overcome a 10x performance differences if certain lengths of K occur at the wrong time. For example, there maybe 100 small K values or 1000 in a test run. Many of those small K's may be in the first half of the test or the latter.

The lack of a warmup also fails to take intrinsics into consideration. Once the C2 compiler decides the method is hot, the intrinsic will change the performance values and disrupt the results distribution. That is not a weakness in the implementation, that's a failure to test during normal system operation.

If the test result deviates significantly from a normal distribution, then this either means just bad luck or an actual bias. I suspect that the larger variance of the test results reported above were just caused by a small sample size.

The results were generated with 50000 iteration that the wycheproof test uses. It's true that more iterations will reduce the influence of noise, but a warmup would reduce the biggest noise influencer and not result in a significantly longer test run.

There are a number of things that could potentially be done to improve the accuracy of the test. Obviously, generating more signatures gives better results. Better timing information would help, but unfortunately it is often difficult to influence the test environment. More detailed timing (e.g., time spent in particular functions) would also allow to improve the test.

Accept what I suggested or not, that is your decision.

Dec 22 '23 18:12 ascarpino

The point I wanted to make is that there can not be a test failure because of noise. If the implementation is correct then expected result will be close to a normal distribution with variance 1.

Too much noise can of course hide timing leaks. By selecting the signatures with the shortest timing the test eliminates the biggest influences of noise without the need to examine the environment. Slow signatures during startup or during garbage collection are most likely not used. As long as their number is small they don't have a significant influence. Also, if the test becomes slower in the middle because of heavy other load on the machine then the result will be computed just from the 25'000 signatures generated during the quiet time. This can miss a bias, but I can't lead to false positives.

The current setup of the test is for continuous testing. If a randomized test is repeated many times then it is important that the probability of false positives is small. Hence the large threshold. For other usages it might be reasonable to use a smaller threshold.

Dec 22 '23 20:12 bleichenbacher-daniel

wycheproof wycheproof copied to clipboard

DsaTest.testTiming() could use a warmup

wycheproof
wycheproof copied to clipboard