criterion.rs Improve linear regression method

In criterion, least-square methods were used to fit lines and get linear regression. However, the "white noise" assumption cannot be made, as the expected value of bias (distribution) is not zero.

In most benchmark cases, our function will use the CPU at full speed. In most cases, bias in the situation will lead to a slower speed but not a higher speed. So the bias follows a skew distribution which means the method of linear regression could be improved.

But I haven't found any information or paper about it (which I think there should have). I will do more research on this field these days.

Aug 28 '19 06:08 YangKeao

Maybe we could use quantile regression.

Aug 28 '19 07:08 YangKeao

You are absolutely correct. I would not expect the time of repeated runs of a benchmark to follow a true normal distribution. For one thing no matter how many times I time it I suspect I will never get a negative runtime! For another, as you point out, I would expect a fairly heavy right tail from every time the benchmark got interrupted by the OS and other sutch outlier inducing bias.

There is an additional assumption for linear regression that I think does not hold for our use case. It is called "Homoscedasticity" it means that the uncertainty in the measurement is not related to N. In our case I suspect that the timing we get from running the benchmark once is much noisier than the timing we get running the benchmark 100 times.

For now let's ignore the point from the first paragraph, and assume that the benchmark time is normally distributed.

b ~ N(t, e), where b is the time we see a benchmark taking, ~ N denotes being sampled from a Normal distribution, t is the "true" time we want to report to the user, and e is the underlying random noise in how long the benchmark takes to run.

So when criterion runs it's N=1 case we expect it to take: b1 = b where b1 is the time we see for the N=1 case. So b1 ~ N(t, e)

For the N=2 case we expect it to take: b2 = b + b. So b2 ~ (N(t, e) +N(t, e)) but the expectation of sums of normal distributions is not linear, so b2 is not ~ N(2t, e) like "Homoscedasticity" requires, it is b2 ~ N(2t, sqrt(2) * e)

In general we have: bn ~ N(n*t, sqrt(n) * e)

Your concern is moderated by the code that throws out the extreme outliers. My concern is moderated by using the bootstrap instead of the estimated variance. Therefore I suspect that correcting for these limitations will not have a significant impact on criterion's results, but the only way to be sure is to model it correctly and see if it is different! I would start by seeing if there is a way to express what we need using a "Generalized linear model". If we can't find the right way to express our needs in that framework, then this should all be pretty straight forward using bayesian methods.

Aug 28 '19 15:08 Eh2406

@Eh2406 As I know criterion don't throw any outliers. It will only warn you. I have a forked branch of criterion which deletes outliers. (It can only work with --noplot now.)

I have also developed another version of criterion to test quantile regression. It performs surprisingly well.

gnuplot

Aug 29 '19 09:08 YangKeao

Thank you, I did not realize it included the outliers!

That graph looks promising! How hard would it be to show both regression lines? I would find that a useful way to see if the regressions are practically different.

Aug 29 '19 14:08 Eh2406

FWIW This paper has some ideas about how to deal with non-normally distributed data: https://htor.inf.ethz.ch/publications/img/hoefler-scientific-benchmarking.pdf

Quantile regression would be great to have included with criterion.

Sep 24 '19 17:09 gz

Hey, thanks for the suggestion. I've been super busy IRL lately so it might be a while before I can get to this. I'll definitely have to dig into quantile regression, I haven't heard of it before.

Sep 27 '19 02:09 bheisler

@gz Thanks for sharing the paper! It inspired me a lot!

@bheisler I have a modified version of criterion.rs which used quantile_regression (I have also developed a crate langbardo to solve quantile regression by solving linear constraints). If you can design a customizable Analysis/Report process, I can integrate it with criterion right now.

The biggest problem is that quantile regression calculates much slower than LSM (if with the simplest algorithm). Too big resample numbers is not acceptable.

Sep 27 '19 06:09 YangKeao

If the bias doesn't follow normal distribution, the result is not reliable.

I think that's a bit overstating it: the OLS regression is an unbiased, consistent estimator of the mean with asymptotically good variance, regardless of the distributions involved. I do agree, however, that it could be improved.

In our case I suspect that the timing we get from running the benchmark once is much noisier than the timing we get running the benchmark 100 times.

The variance of OLS regression of triangularly sampled benchmarks (time running once, then time running twice, then time running thrice, etc. to total a triangular number of runs) has 3 times the variance of doing the same number of runs in one batch and dividing that time by the number of runs. (This is in the limit; many practical sizes end up closer to a factor of 2 overhead.) That's certainly not good, and we can do better, but at least the noise overhead doesn't grow unboundedly with the number of samples.

Criterion's model is that there are two unknown probability distributions, M (measurement overhead) and X (the metric of the actual function we wish to test). These distributions are not assumed to be normal, and while they should have nonnegative support, that isn't necessary. A data point that Criterion takes with i iterations (which I'll call dᵢ) is assumed to be the sum of one sample from M and i independent samples from X. Any linear predictor for slope that acts correctly on truly collinear data (including but not limited to OLS regression) will be an unbiased estimator for the mean of X.

Proof

A linear predictor is specified as a set of coefficients (aᵢ) for a linear combination of the data points, so

estimate = ∑ᵢ (aᵢ × dᵢ)

We assume that the predictor acts correctly on collinear data, that is, for all m and b,

∑ᵢ (aᵢ × (m × i + b)) = m

However,

E[dᵢ] = E[M] + E[X] + E[X] + ...
      = E[M] + i × E[X]

so

E[estimate] = E[∑ᵢ (aᵢ × dᵢ)]
            = ∑ᵢ (aᵢ × E[dᵢ])
            = ∑ᵢ (aᵢ × (E[X] × i + E[M]))
            = E[X]

Since all samples are taken independently, the variance of a linear predictor for a fixed number of data points can be broken down with no covariance terms to a linear combination of the variance of M and the variance of X. Since measurement overhead is typically simple and consistent, it makes more sense to minimize the coefficient of the variance of X. Directly solving for the optimal solution, the appropriate coefficients are

aᵢ = λ₁ - λ₂ / i
  where
    λ₁ = H / (H × ∑ᵢ (i) - n²)
    λ₂ = n / (H × ∑ᵢ (i) - n²)
    H = ∑ᵢ (1 / i)
    n = number of data points

Derivation

The variance of an arbitrary linear predictor goes as

Var[estimate] = Var[∑ᵢ (aᵢ × dᵢ)]
              = ∑ᵢ (aᵢ² × Var[dᵢ])
              = ∑ᵢ (aᵢ² × (Var[M] + i × Var[X]))
              = (∑ᵢ (aᵢ²)) × Var[M] + (∑ᵢ (i × aᵢ²)) × Var[X]

We also have the constraint that for all m and b,

∑ᵢ (aᵢ × (m × i + b)) = m

or split into coefficients of m and b,

∑ᵢ (aᵢ × i) = 1
∑ᵢ (aᵢ) = 0

To minimize variance, we find a stationary point of ∑ᵢ (i × aᵢ²) + λ₁ × 2 × (1 - ∑ᵢ (aᵢ × i)) + λ₂ × 2 × ∑ᵢ (aᵢ) with respect to the aᵢ, using the two Lagrange multipliers λ₁ and λ₂ to enforce the "works on collinear data" constraints. (The unnecessary factors of 2 make later equations slightly simpler). Setting all derivatives to zero,

2 × i × aᵢ - λ₁ × 2 × i + λ₂ × 2 = 0
∑ᵢ (aᵢ × i) = 1
∑ᵢ (aᵢ) = 0

From the first (parameterized) equation, we can write each aᵢ in terms of the Lagrange multipliers:

aᵢ = (λ₁ × 2 × i - λ₂ × 2) / (2 × i)
   = λ₁ - λ₂ / i

Plugging that into the last constraint,

0 =∑ᵢ (aᵢ)
  = ∑ᵢ (λ₁ - λ₂ / i)
  = λ₁ × ∑ᵢ (1) - λ₂ × ∑ᵢ (1 / i)
λ₁ = λ₂ × ∑ᵢ (1 / i) / ∑ᵢ (1)

We will label the number of data points as n and the sum of the reciprocals of the iteration counts H, so λ₁ = λ₂ × H / n. Applying the middle constraint,

1 = ∑ᵢ (aᵢ × i)
  = ∑ᵢ ((λ₁ - λ₂ / i) × i)
  = ∑ᵢ (λ₁ × i - λ₂)
  = λ₁ × ∑ᵢ (i) - λ₂ × n
  = λ₂ × H × ∑ᵢ (i) / n - λ₂ × n
  = λ₂ × (H × ∑ᵢ (i) / n - n)
  = λ₂ × (H × ∑ᵢ (i) - n²) / n
λ₂ = n / (H × ∑ᵢ (i) - n²)

With the optimal solution in hand, we can fairly compare methods. In all of the following, N total iterations (not measurements) are taken.

Batches-of-one: Measure a single iteration N times, take the mean

Biased and inconsistent: E[estimate] = E[X] + E[M] Variance from user's function: N⁻¹ × Var[X] Variance due to measurement overhead: N⁻¹ × Var[M]
OLS regression on triangular batches: Take √(2N) measurements of batches starting at size 1 and increasing by 1 between batches, then calculate the OLS slope

Unbiased and consistent: E[estimate] = E[X] Variance from user's function: (3 + o(1)) × N⁻¹ × Var[X] Variance due to measurement overhead: (4.2464... + o(1))× N^-3/2 ×Var[M]
Minimum-variance linear regression on triangular batches: Take √(2N) measurements of batches starting at size 1 and increasing by 1 between batches, then calculate the slope with the coefficients as above

Unbiased and consistent: E[estimate] = E[X] Variance from user's function: (1 + o(1)) × N⁻¹ × Var[X] Variance due to measurement overhead: (13.159... + o(1)) × N⁻¹ × (ln N)⁻² × Var[M]
Mean-of-means on triangular batches: Take √(2N) measurements of batches starting at size 1 and increasing by 1 between batches, divide each measurement by the number of iterations in the batch, and take the mean of the results

Biased and consistent: E[estimate] = E[X] + (0.35355... + o(1)) × ln N / √N × E[M] Variance from user's function: (0.25 + o(1)) × ln N × N⁻¹ × Var[X] Variance due to measurement overhead: (0.822... - o(1)) × N⁻¹ × Var[M]
Single batch: Take a single measurement of a single batch with N iterations, then divide by N.

Biased and consistent: E[estimate] = E[X] + N⁻¹ × E[M] Variance from user's function: N⁻¹ × Var[X] Variance due to measurement overhead: N⁻² × Var[M]

Variance Calculations

Batches-of-one:

Var[(∑ᵢ (dᵢ)) / N] = ∑ᵢ (Var[dᵢ]) / N²
                   = Var[dᵢ] / N
                   = N⁻¹ × Var[X] + N⁻¹ × Var[M]

OLS on triangular batches:

Looking back at my stuff from when I calculated this earlier, I didn't take the theoretical analysis all the way: I just programmatically calculated

Var[∑ᵢ (aᵢ × dᵢ)] = ∑ᵢ (aᵢ² × Var[dᵢ])
                  = ∑ᵢ (aᵢ² × (i × Var[X] + Var[M]))
                  = Var[X] × ∑ᵢ (aᵢ² × i) + Var[M] × ∑ᵢ (aᵢ²)

plugging in appropriate coefficients and found that it exactly matched

6 / (n × (n - 1)) × Var[X] + 12 / (n × (n - 1) × (n + 1)) × Var[M]
 = (6 + o(1)) × n⁻² × Var[X] + (12 + o(1)) × n⁻³ × Var[M]
 = (3 + o(1)) × N⁻¹ × Var[X] + (3 × √2 + o(1)) × √N⁻³ × Var[M]

I'll may edit in a proof at some point.

Minimum-variance regression on triangular batches:

Substituting n = √(2N) and the logarithmic approximation to H, nth harmonic number, we derive the asymptotic properties of the constants involved:

H = ln (√(2N)) + γ + o(1)
  = (ln N) / 2 + γ + (ln 2) / 2 + o(1)
  = (ln N) / 2 + 0.923789... + o(1)
  = (1/2 + o(1)) × ln N
λ₁ = H / (H × ∑ᵢ (i) - n²)
   = H / (H × n × (n + 1) / 2 - n²)
   = H / (H × n² × (1/2 + 1/(2 × n) - 1/H))
   = H / (H × N × (1 + 1/n - 2/H))
   = H / (H × N) × 1/(1 + 1/n - 2/H)
   = H / (H × N) × (1 + o(1))
   = (1 + o(1)) × N⁻¹

For simplicity, we only use λ₁:

aᵢ = λ₁ - λ₂ / i
   = λ₁ - (λ₁ × n / H) × i
   = λ₁ × (1 - n / (H × i))

These pieces combine to calculate variance:

Var[∑ᵢ (aᵢ × dᵢ)] = ∑ᵢ (aᵢ² × Var[dᵢ])
                  = ∑ᵢ (aᵢ² × (i × Var[X] + Var[M]))
                  = Var[X] × ∑ᵢ (aᵢ² × i) + Var[M] × ∑ᵢ (aᵢ²)
                  = Var[X] × ∑ᵢ (λ₁² × (1 - n / (H × i))² × i)
                   + Var[M] × ∑ᵢ (λ₁² × (1 - n / (H × i))²)
                  = λ₁² × Var[X] × ∑ᵢ ((1 - n / (H × i))² × i)
                   + λ₁² × Var[M] × ∑ᵢ ((1 - n / (H × i))²)
                  = λ₁² × Var[X] × ∑ᵢ ((1 - 2 × n / (H × i) + n² / (H² × i²)) × i)
                   + λ₁² × Var[M] × ∑ᵢ (1 - 2 × n / (H × i) + n² / (H² × i²))
                  = λ₁² × Var[X] × ∑ᵢ (i - 2 × n / H + n² / (H² × i))
                   + λ₁² × Var[M] × ∑ᵢ (1 - 2 × n / (H × i) + n² / (H² × i²))
                  = λ₁² × Var[X] × (∑ᵢ (i) - 2 × n / H × ∑ᵢ (1) + n² / H² × ∑ᵢ (1 / i))
                   + λ₁² × Var[M] × (∑ᵢ (1) - 2 × n / H × ∑ᵢ (1 / i) + n² / H² × ∑ᵢ (1 / i²))
                  = λ₁² × Var[X] × (∑ᵢ (i) - 2 × n² / H + n² / H)
                   + λ₁² × Var[M] × (n - 2 × n + n² / H² × ∑ᵢ (1 / i²))
                  = λ₁² × Var[X] × (∑ᵢ (i) - n² / H)
                   + λ₁² × Var[M] × (n² / H² × ∑ᵢ (1 / i²) - n)
                  = λ₁² × Var[X] × (H × ∑ᵢ (i) - n²) / H
                   + λ₁² × Var[M] × (n² / H² × ∑ᵢ (1 / i²) - n)
                  = λ₁ × Var[X]
                   + λ₁² × Var[M] × (n² / H² × ∑ᵢ (1 / i²) - n)
                  = (1 + o(1)) × N⁻¹ × Var[X]
                   + (1 + o(1)) × N⁻² × (n² / H² × ∑ᵢ (1 / i²) - n) × Var[M]
                  = (1 + o(1)) × N⁻¹ × Var[X]
                   + (1 + o(1)) × N⁻² × n² / H² × ∑ᵢ (1 / i²) × Var[M]
                  = (1 + o(1)) × N⁻¹ × Var[X]
                   + (π²/6 + o(1)) × N⁻² × n² / H² × Var[M]
                  = (1 + o(1)) × N⁻¹ × Var[X]
                   + (π²/3 + o(1)) × N⁻¹ / H² × Var[M]
                  = (1 + o(1)) × N⁻¹ × Var[X]
                   + (4 × π²/3 + o(1)) × N⁻¹ × (ln N)⁻² × Var[M]

Mean-of-means on triangular batches:

This is a linear combination like the regressions, with coefficients

aᵢ = 1 / (n × i)

In expectation,

E[∑ᵢ (aᵢ × dᵢ)] = ∑ᵢ (aᵢ × E[dᵢ])
                = ∑ᵢ ((i × E[X] + E[M]) / (n × i))
                = ∑ᵢ (E[X] / n + E[M] / (n × i))
                = E[X] / n × ∑ᵢ (1) + E[M] / n × ∑ᵢ (1 / i)
                = E[X] + E[M] / n × ∑ᵢ (1 / i)

As in the minimum-variance regression, harmonic numbers converge to ln n + O(1) = (1 + o(1)) × ln n = (1/2 + o(1)) × ln N, so

E[∑ᵢ (aᵢ × dᵢ)] = E[X] + E[M] / n × (1 + o(1)) × ln n
                = E[X] + (1/2 + o(1)) × ln N / n × E[M]
                = E[X] + (1/2 + o(1)) × ln N / √(2 × N) × E[M]
                = E[X] + (1/(2 × √2) + o(1)) × ln (N) / √N × E[M]

Calculating variance,

Var[∑ᵢ (aᵢ × dᵢ)] = ∑ᵢ (aᵢ² × Var[dᵢ])
                  = ∑ᵢ (aᵢ² × (i × Var[X] + Var[M]))
                  = Var[X] × ∑ᵢ (aᵢ² × i) + Var[M] × ∑ᵢ (aᵢ²)
                  = 1 / n² × Var[X] × ∑ᵢ (1 / i) + 1 / n² × Var[M] × ∑ᵢ (1 / i²)
                  = 1/2 × N⁻¹ × Var[X] × ∑ᵢ (1 / i) + 1/2 × N⁻¹ × Var[M] × ∑ᵢ (1 / i²)
                  = (1/4 + o(1)) × ln N × N⁻¹ × Var[X] + (π²/12 - o(1)) × N⁻¹ × Var[M]

Single batch:

Var[(M + X + X + ... + X)/N] = N⁻² × (Var[M] + Var[X] + Var[X] + ... + Var[X])
                             = N⁻² × (Var[M] + N × Var[X])
                             = N⁻¹ × Var[X] + N⁻² × Var[M]

Note that none of this analysis made any assumptions about normality or any other property of the distributions. It also only considers linear predictors that simply take a linear combination of the data points; I wouldn't be surprised if a nonlinear predictor could exploit the nonnegative support assumption to get better results, though I wouldn't expect massive improvement.

To be clear, none of this invalidates the interest in quantile regression, which is qualitatively different. I only considered estimation of the mean.

TL;DR: OLS regression isn't bad at finding the mean, and it could be improved up to a factor of 1.73 (and no more) in standard deviation while keeping minimal assumptions. The "additional statistics" that come from treating data points as samples (dividing by iteration count) are biased and converge particularly slowly. Measuring medians and quantiles with quantile regression could be more practically useful in most caes, but means are useful for e.g. throughput.

Jan 26 '20 20:01 gereeter

If the bias doesn't follow normal distribution, the result is not reliable.

I think that's a bit overstating it: the OLS regression is an unbiased, consistent estimator of the mean with asymptotically good variance, regardless of the distributions involved. I do agree, however, that it could be improved.

In our case I suspect that the timing we get from running the benchmark once is much noisier than the timing we get running the benchmark 100 times.

The variance of OLS regression of triangularly sampled benchmarks (time running once, then time running twice, then time running thrice, etc. to total a triangular number of runs) has 3 times the variance of doing the same number of runs in one batch and dividing that time by the number of runs. (This is in the limit; many practical sizes end up closer to a factor of 2 overhead.) That's certainly not good, and we can do better, but at least the noise overhead doesn't grow unboundedly with the number of samples.

Criterion's model is that there are two unknown probability distributions, M (measurement overhead) and X (the metric of the actual function we wish to test). These distributions are not assumed to be normal, and while they should have nonnegative support, that isn't necessary. A data point that Criterion takes with i iterations (which I'll call dᵢ) is assumed to be the sum of one sample from M and i independent samples from X. Any linear predictor for slope that acts correctly on truly collinear data (including but not limited to OLS regression) will be an unbiased estimator for the mean of X. Proof

Since all samples are taken independently, the variance of a linear predictor for a fixed number of data points can be broken down with no covariance terms to a linear combination of the variance of M and the variance of X. Since measurement overhead is typically simple and consistent, it makes more sense to minimize the coefficient of the variance of X. Directly solving for the optimal solution, the appropriate coefficients are
aᵢ = λ₁ - λ₂ / i
  where
    λ₁ = H / (H × ∑ᵢ (i) - n²)
    λ₂ = n / (H × ∑ᵢ (i) - n²)
    H = ∑ᵢ (1 / i)
    n = number of data points
Derivation

With the optimal solution in hand, we can fairly compare methods. In all of the following, N total iterations (not measurements) are taken.
* Batches-of-one: Measure a single iteration `N` times, take the mean
  Biased and inconsistent: `E[estimate] = E[X] + E[M]`
  Variance from user's function: `N⁻¹ × Var[X]`
  Variance due to measurement overhead: `N⁻¹ × Var[M]`

* OLS regression on triangular batches: Take `√(2N)` measurements of batches starting at size 1 and increasing by 1 between batches, then calculate the OLS slope
  Unbiased and consistent: `E[estimate] = E[X]`
  Variance from user's function: `(3 + o(1)) × N⁻¹ × Var[X]`
  Variance due to measurement overhead: `(4.2464... + o(1))`× N-3/2 ×`Var[M]`

* Minimum-variance linear regression on triangular batches: Take `√(2N)` measurements of batches starting at size 1 and increasing by 1 between batches, then calculate the slope with the coefficients as above
  Unbiased and consistent: `E[estimate] = E[X]`
  Variance from user's function: `(1 + o(1)) × N⁻¹ × Var[X]`
  Variance due to measurement overhead: `(13.159... + o(1)) × N⁻¹ × (ln N)⁻² × Var[M]`

* Mean-of-means on triangular batches: Take `√(2N)` measurements of batches starting at size 1 and increasing by 1 between batches, divide each measurement by the number of iterations in the batch, and take the mean of the results
  Biased and consistent: `E[estimate] = E[X] + (0.35355... + o(1)) × ln N / √N × E[M]`
  Variance from user's function: `(0.25 + o(1)) × ln N × N⁻¹ × Var[X]`
  Variance due to measurement overhead: `(0.822... - o(1)) × N⁻¹ × Var[M]`

* Single batch: Take a single measurement of a single batch with `N` iterations, then divide by `N`.
  Biased and consistent: `E[estimate] = E[X] + N⁻¹ × E[M]`
  Variance from user's function: `N⁻¹ × Var[X]`
  Variance due to measurement overhead: `N⁻² × Var[M]`
Variance Calculations

Note that none of this analysis made any assumptions about normality or any other property of the distributions. It also only considers linear predictors that simply take a linear combination of the data points; I wouldn't be surprised if a nonlinear predictor could exploit the nonnegative support assumption to get better results, though I wouldn't expect massive improvement.

To be clear, none of this invalidates the interest in quantile regression, which is qualitatively different. I only considered estimation of the mean.

TL;DR: OLS regression isn't bad at finding the mean, and it could be improved up to a factor of 1.73 (and no more) in standard deviation while keeping minimal assumptions. The "additional statistics" that come from treating data points as samples (dividing by iteration count) are biased and converge particularly slowly. Measuring medians and quantiles with quantile regression could be more practically useful in most caes, but means are useful for e.g. throughput.

Has somone verified the proof?

Criterion's model is that there are two unknown probability distributions, M (measurement overhead) and X (the metric of the actual function we wish to test). These distributions are not assumed to be normal, and while they should have nonnegative support, that isn't necessary. A data point that Criterion takes with i iterations (which I'll call dᵢ) is assumed to be the sum of one sample from M and i independent samples from X. Any linear predictor for slope that acts correctly on truly collinear data (including but not limited to OLS regression) will be an unbiased estimator for the mean of X.

Is there another resource who describes the Model of Criterion?

Jan 14 '21 09:01 realAP

I did some searching of robust regression methods, and one in particular seems to fit Criterion's regression needs: M-estimation. It is much more vulnerable to outliers in the explanatory variable and thus has fallen out of favor as a general robust regression method, but such outliers are extremely unlikely under criterion's model of timing a predetermined number of iterations at a time.

Jan 28 '21 20:01 eaglgenes101

@gereeter Thanks for your proof. I realized that even with OLS, the unbiasness of slope can be also derived from the i.d.d., which means if all bias has the same distribution (even if their expected value is not zero), the estimate of slope is also unbias.

In your derivation of aᵢ, if we choose to minimize the coefficients in front of Var[M], we can get the same result of OLS a_i = \frac{i - \bar i}{\sum i(i - \bar i)}, which is not surprised.

Feb 09 '22 15:02 YangKeao