mathnet-numerics
mathnet-numerics copied to clipboard
Performance of SpecialFunctions.Erf and SpecialFunctions.Erfinv
I have used mathnet to provide the special functions Erf and ErfInv. They are however creating a bottleneck as I need the Error function to implement the Gaussian copula in the context of Monte Carlo simulation,
I have optimized the functions, by removing unnecessary local variables and redundant branches.
The error function and inverse error function relies heavily on Polynomial.Evaluate
which was a big bottleneck. For example, many branches use the statement.
Polynomial.Evaluate(z, arrayN) / Polynomial.Evaluate(z, arrayD)
I have replaced these with various functions performing this manually. as below.
private static double ErfImplA(double z)
{
const double n0 = 0.0033791670955125737;
const double n1 = -0.0007369565304816795;
const double n2 = -0.3747323373929196;
const double n3 = 0.08174424487335873;
const double n4 = -0.04210893199365486;
const double n5 = 0.007016570951209575;
const double n6 = -0.004950912559824351;
const double n7 = 0.0008716465990379225;
const double d0 = 1.0;
const double d1 = -0.21808821808792464;
const double d2 = 0.4125429727254421;
const double d3 = -0.08418911478731067;
const double d4 = 0.06553388564002416;
const double d5 = -0.012001960445494177;
const double d6 = 0.00408165558926174;
const double d7 = -0.0006159007215577697;
var n =
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(n7,
z, n6),
z, n5),
z, n4),
z, n3),
z, n2),
z, n1),
z, n0);
var d =
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(
FusedMultiplyAdd(d7,
z, d6),
z, d5),
z, d4),
z, d3),
z, d2),
z, d1),
z, d0);
return n / d;
}
This however comes at the cost of multiplying the polynomials out by hand but the results are significant.
| Method | Mean | Error | StdDev | Median |
|------------------------------ |------------:|-----------:|-----------:|------------:|
| ErfOriginalAllBranchAverage | 915.76 ns | 18.293 ns | 19.574 ns | 908.30 ns |
| ErfOptimisedAllBranchAverage | 457.32 ns | 6.222 ns | 5.196 ns | 457.92 ns |
| ErfOriginalHotBranchAverage1 | 237.21 ns | 2.324 ns | 2.060 ns | 236.85 ns |
| ErfOptimisedHotBranchAverage1 | 114.23 ns | 2.301 ns | 5.601 ns | 113.38 ns |
| ErfOriginalHotBranchAverage2 | 70.52 ns | 1.440 ns | 3.692 ns | 69.29 ns |
| ErfOptimisedHotBranchAverage2 | 26.51 ns | 0.552 ns | 0.542 ns | 26.38 ns |
| OriginalErfInv | 6,498.57 ns | 129.614 ns | 310.547 ns | 6,409.22 ns |
| OptimisedErfInv | 2,592.28 ns | 49.925 ns | 46.700 ns | 2,577.26 ns |
The outputs are not identical to the current implementation, but are very close (1.1102230246251565E-16 in small test of error function).
I have attached my implementation and benchmark code.
I am not sure if this would be considered a breaking change. If it cannot be included in Mathnet, I will add it to my library instead.
See example code here.
Hi,
Could this behavior be platform specific? I am running your code sample on my MacBook Pro M1 (in Release mode), and get the following results. It seems that the library implementation is faster in all cases, am I reading this correctly?
// * Summary *
BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.3 (21E230) [Darwin 21.4.0]
Apple M1 2.40GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=6.0.103
[Host] : .NET 5.0.15 (5.0.1522.11506), X64 RyuJIT [AttachedDebugger]
DefaultJob : .NET 5.0.15 (5.0.1522.11506), X64 RyuJIT
| Method | Mean | Error | StdDev |
|------------------------------ |-------------:|-----------:|-----------:|
| ErfOriginalAllBranchAverage | 738.83 ns | 2.640 ns | 2.470 ns |
| ErfOptimisedAllBranchAverage | 2,843.27 ns | 13.305 ns | 12.445 ns |
| ErfOriginalHotBranchAverage1 | 224.54 ns | 0.684 ns | 0.640 ns |
| ErfOptimisedHotBranchAverage1 | 1,005.54 ns | 4.267 ns | 3.991 ns |
| ErfOriginalHotBranchAverage2 | 55.91 ns | 0.080 ns | 0.075 ns |
| ErfOptimisedHotBranchAverage2 | 320.37 ns | 5.005 ns | 4.179 ns |
| OriginalErfInv | 4,846.12 ns | 54.304 ns | 48.139 ns |
| OptimisedErfInv | 37,130.96 ns | 365.730 ns | 342.104 ns |
// * Warnings *
Environment
Summary -> Benchmark was executed with attached debugger
// * Hints *
Outliers
Benchmarks.ErfOriginalHotBranchAverage2: Default -> 2 outliers were detected (57.56 ns, 57.61 ns)
Benchmarks.ErfOptimisedHotBranchAverage2: Default -> 2 outliers were removed (345.91 ns, 357.92 ns)
Benchmarks.OriginalErfInv: Default -> 1 outlier was removed (5.16 us)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
1 ns : 1 Nanosecond (0.000000001 sec)
// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:02:35 (155.5 sec), executed benchmarks: 8
Global total time: 00:02:46 (166.84 sec), executed benchmarks: 8
// * Artifacts cleanup *