mathnet-numerics
mathnet-numerics copied to clipboard
MKL provider : OSX vs Windows performance
Hello,
I've started a simple bench. I need to do lots of vector computations for a spring-mass system. For a standard system, my algorithm will run about 1e5 iterations moving about 1e4 points (x,y,z) at each iterations.
I'm investigating how the MKL provider could speed up of my computations. I'm also investigating 2 different layouts for my data structure :
- Array of Structure (AoS)
- Structure of Arrays (SoA)
based on A Guide to Vectorization with Intel® C++ Compilers :
The most common and likely well known data structure is the array, which contains a contiguous collection of data items that can be accessed by an ordinal index. This data can be organized as an Array Of Structures (AOS) or a Structure Of Arrays (SOA). While AOS organization is excellent for encapsulation it can be poor for use of vector processing. Selecting appropriate data structures can also make vectorization of the resulting code more effective.
I've remarked, using the sample of code given bellow, that performance is very different between Mono Mac and Windows with the MKL provider.
The bench does pointwise multiplication of double vectors of size N = 10000
.
- For the AoS layout, the data is organized in a jagged array
double[1000][10]
. - For the SoA layout, the data is organized in a single array or mathnet.vector
double[10000]
.
Here are the results. CPU timing is given in ns per elementary operation (= total CPU time in s x 1e9 / N).
Results for MAC MONO under Yosemite (macbook core i7)
AOS (naive loop) = 6,46 ns/elop
AOS (mathnet managed) = 24,31 ns/elop
AOS (mathnet mkl) = 32,32 ns/elop
SOA (naive loop) = 4,9 ns/elop
SOA (mathnet managed) = 4,26 ns/elop
SOA (mathnet mkl) = 21,36 ns/elop
Results for Windows 7 via VMWARE (macbook core i7)
AOS (naive loop) = 7,19 ns/elop
AOS (mathnet managed) = 19,2 ns/elop
AOS (mathnet mkl) = 14,49 ns/elop
SOA (naive loop) = 4,12 ns/elop
SOA (mathnet managed) = 4,92 ns/elop
SOA (mathnet mkl) = 0,52 ns/elop
Do you know why there's a huge difference between Mac Mono and Windows with the basic pointwise multiplication (x40) ?
Is this inherent to P/Invoke with monomac ? Is this inherent to how I build the MKL provider on OSX ?
Thanks, Lionel
Here's my code on the pointwise multiplication of 2 vectors :
using System;
using MathNet.Numerics.LinearAlgebra;
using MathNet.Numerics;
using System.Diagnostics;
namespace TestConsoleNummerics
{
class MainClass
{
public static void Main (string[] args)
{
//test_Matrix(1000);
int N = 10000;
//test_Vmul(N);
test_Vmul_AOSvsOAS(N);
Console.Read();
}
static void test_Vmul_AOSvsOAS(int N)
{
// Pb definition
int loop = 10000;
var w = Stopwatch.StartNew();
Random rnd = new Random();
int n = 10;
int ne = N / n;
// AOS : naive loop
var aos_x = new double[ne][];
var aos_y = new double[ne][];
for (int i = 0; i < aos_x.Length; i++)
{
aos_x[i] = new double[n];
aos_y[i] = new double[n];
for (int j = 0; j < aos_x[i].Length; j++)
{
aos_x[i][j] = rnd.NextDouble();
}
}
w.Restart();
for (int k = 0; k < loop; k++)
{
for (int i = 0; i < aos_x.Length; i++)
{
for (int j = 0; j < aos_x[i].Length; j++)
{
aos_y[i][j] = aos_x[i][j]*aos_x[i][j];
}
}
}
Console.WriteLine("AOS (naive loop) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");
// AOS : MathManged & MKL
var aos_mathnet_x = new Vector<double>[ne];
var aos_mathnet_y = new Vector<double>[ne];
for (int i = 0; i < aos_x.Length; i++)
{
aos_mathnet_x[i] = Vector<double>.Build.Dense(n);
aos_mathnet_y[i] = Vector<double>.Build.Dense(n);
for (int j = 0; j < aos_x[i].Length; j++)
{
aos_mathnet_x[i][j] = aos_x[i][j];
}
}
Control.UseManaged();
w.Restart();
for (int k = 0; k < loop; k++)
{
for (int i = 0; i < aos_x.Length; i++)
{
aos_mathnet_x[i].PointwiseMultiply(aos_mathnet_x[i], aos_mathnet_y[i]);
}
}
Console.WriteLine("AOS (mathnet managed) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");
Control.UseNativeMKL();
w.Restart();
for (int k = 0; k < loop; k++)
{
for (int i = 0; i < aos_x.Length; i++)
{
aos_mathnet_x[i].PointwiseMultiply(aos_mathnet_x[i], aos_mathnet_y[i]);
}
}
Console.WriteLine("AOS (mathnet mkl) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");
// SOA : naive loop
var soa_x = new double[N];
var soa_y = new double[N];
for (int i = 0; i < aos_x.Length; i++)
{
for (int j = 0; j < aos_x[i].Length; j++)
{
soa_x[n * i + j] = aos_x[i][j];
}
}
w.Restart();
for (int k = 0; k < loop; k++)
{
for (int i = 0; i < soa_x.Length; i++)
{
soa_y[i] = soa_x[i] * soa_x[i];
}
}
Console.WriteLine("SOA (naive loop) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");
// SOA : MathManged & MKL
var soa_mathnet_x = Vector<double>.Build.Random(N);
var soa_mathnet_y = Vector<double>.Build.Dense(N);
for (int i = 0; i < soa_x.Length; i++)
{
soa_mathnet_x[i] = soa_x[i];
}
Control.UseManaged();
w.Restart();
for (int i = 0; i < loop; i++)
{
soa_mathnet_x.PointwiseMultiply(soa_mathnet_x, soa_mathnet_y);
}
Console.WriteLine("SOA (mathnet managed) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");
Control.UseNativeMKL();
w.Restart();
for (int i = 0; i < loop; i++)
{
soa_mathnet_x.PointwiseMultiply(soa_mathnet_x, soa_mathnet_y);
}
Console.WriteLine("SOA (mathnet mkl) = " + w.ElapsedMilliseconds * 1e6 / (loop * N) + " ns/elop");
}
}
}
Perhaps multi-threading isn't being used for MKL on Mac? Might want to check how many threads are being set and also make sure the MKL wrapper is linking to the correct version of MKL when built on Mac.
Only thing I can think of.
If it is a p/invoke problem then you could try putting your loop into C/C++ and call the MKL wrapper from there. Then there'd only be a single p/invoke call.