math icon indicating copy to clipboard operation
math copied to clipboard

Speedup QR decomposition

Open t4c1 opened this issue 6 years ago • 10 comments

Description

Eigen's QR decomposition can be improved on with better parameter tunning. GPUs can be used for further speedup.

Example

QR decomposition is faster.

Expected Output

QR decomposition is faster.

Current Version:

v2.19.1

t4c1 avatar May 07 '19 07:05 t4c1

I have implemented three versions of the algorithm. First uses CPU, second GPU, and the last is hybrid version that uses GPU only for larger matrix products.

Here are some graphs showing speedups relative to implementation from Eigen. Even my CPU version is faster, despite being the same algorithm. I guess mine has better tuned its parameter (block size) for my CPU. Measurements were done on Core i5 2500 and GTX 1070. qr_speedup_const_cols_mkl qr_speedup_const_rows_mkl qr_speedup_square_mkl

The question is whether we want all three in stan math? CPU and hybrid are relatively simple implementations, while full GPU needs four new kernels.

t4c1 avatar May 07 '19 07:05 t4c1

In case you have a Intel CPU.. could you compile things with the Intel MKL as a backend for Eigen? As far as I am aware Eigen can use that and it should make a difference for Intel CPUs at least.

wds15 avatar May 07 '19 07:05 wds15

I will try.

t4c1 avatar May 07 '19 07:05 t4c1

Thanks for trying... I know this is a lot of work... but maybe the MKL is a great option as well. As we head in the direction of these optimisations it is good to have some overview.

wds15 avatar May 07 '19 07:05 wds15

I have used Eigen with MKL before, so I have everything already set up. Tests are already running.

t4c1 avatar May 07 '19 07:05 t4c1

I have updated the graphs with MKL results.

t4c1 avatar May 07 '19 09:05 t4c1

So Eigen MKL parallel gives already some speedup.... but this seems like a unfair comparison in that your CPU version is single-core, right?

How many cores were used for parallel?

In any case, it seems that your proposed variant speeds up things.

wds15 avatar May 07 '19 09:05 wds15

Actually sequantial MKL speeds up things by around 20%, but that is hard to see on the scale of the graphs.

I don't think comparison is unfair, but speedups from running multiple chains in parallel are probably better.

Parallel MKL used all 4 cores.

t4c1 avatar May 07 '19 09:05 t4c1

@t4c1 is there a branch with these versions?

spinkney avatar Mar 24 '24 18:03 spinkney

Huh that is old. I found the branch here: https://github.com/bstatcomp/math/tree/gpu_qr

t4c1 avatar Mar 25 '24 18:03 t4c1