very slow diagonalization with openblas-ohpc in 1.3.1
Hi,
I'm a newbie using openhpc for the first time. I'm really grateful for this project since it saved a lot of my time.
But, it turns out that matrix diagonalization (dsyev) with openblas shipped in 1.3.1 is significantly slower than my own building of a single thread version with sandybridge kernel installed in my home directory.
What are the options given when building this openblas-gnu7-ohpc rpm package?
It even sometimes hangs when running my simulation code which involves a huge number of diagonalizations, so it is not practically usable for me. The speed difference is already notable in the simple test code attached below. I compiled the code with "-march=native -O2" optimization. In a few runs done in my frontend node (Xeon E5-2620v2), the test code was on average 4 times slower with openblas-gnu7-ohpc.
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cassert>
extern "C" {
void dsyev_(const char* jobz, const char* uplo, int* n,
double* a, int* lda,
double* w, double* work, int* lwork, int* info);
}
void syev(const char jobz, const char uplo, int n, double* a, double* w)
{
int lwork, info;
double tmp;
lwork=-1;
dsyev_(&jobz,&uplo,&n,a,&n,w,&tmp,&lwork,&info);
assert(info==0);
lwork=(int)(tmp+0.1);
double* work=new double[lwork];
dsyev_(&jobz,&uplo,&n,a,&n,w,work,&lwork,&info);
delete[] work;
assert(info==0);
}
int main() {
srand48(time(NULL));
for (int k=0;k<100;++k) {
const int N=200+(int)(200.0*drand48());
double* mat=new double[N*N];
double* eig=new double[N];
for (int j=0;j<N;++j) {
for (int i=j;i<N;++i) {
mat[i+j*N]=drand48();
mat[j+i*N]=mat[i+j*N];
}
mat[j+j*N]+=2.0;
}
syev('V','U',N,mat,eig);
std::cout << N << "\t" << eig[0] << std::endl;
delete[] eig;
delete[] mat;
}
}
Hi @dhkim1231 - thanks for the report. We don't have anything conclusive yet but we tried your code on a few different builds and got the following timings (on haswell):
| Version | Build | Flags | Wallclock Time |
|---|---|---|---|
| 0.2.19 | Source | -march=native -O2 | 4.873s |
| 0.2.19 | OpenHPC 1.3.1 | TARGET_ARCH=dynamic -O2 | 19.814s |
| 0.2.19 | SLE 12 | TARGET_ARCH=dynamic -O2 | 21.738s |
| 0.2.20 | Source | -march=native -O2 | 4.818s |
| 0.2.20 | OpenHPC 1.3.2 | TARGET_ARCH=dynamic -O2 | 5.151s |
Are these more or less in line with what you are seeing?
Yes, the ratio is more or less what I saw with ohpc 1.3.1 on my sandybridge. It's interesting to see that there's a dramatic difference between 1.3.1 and 1.3.2.
A friendly reminder that this issue had no activity for 30 days.