math Speedup QR decomposition

Description

Eigen's QR decomposition can be improved on with better parameter tunning. GPUs can be used for further speedup.

Example

QR decomposition is faster.

Expected Output

QR decomposition is faster.

Current Version:

v2.19.1

May 07 '19 07:05 t4c1

I have implemented three versions of the algorithm. First uses CPU, second GPU, and the last is hybrid version that uses GPU only for larger matrix products.

Here are some graphs showing speedups relative to implementation from Eigen. Even my CPU version is faster, despite being the same algorithm. I guess mine has better tuned its parameter (block size) for my CPU. Measurements were done on Core i5 2500 and GTX 1070. qr_speedup_const_cols_mkl qr_speedup_const_rows_mkl qr_speedup_square_mkl

The question is whether we want all three in stan math? CPU and hybrid are relatively simple implementations, while full GPU needs four new kernels.

May 07 '19 07:05 t4c1

In case you have a Intel CPU.. could you compile things with the Intel MKL as a backend for Eigen? As far as I am aware Eigen can use that and it should make a difference for Intel CPUs at least.

May 07 '19 07:05 wds15

I will try.

May 07 '19 07:05 t4c1

Thanks for trying... I know this is a lot of work... but maybe the MKL is a great option as well. As we head in the direction of these optimisations it is good to have some overview.

May 07 '19 07:05 wds15

I have used Eigen with MKL before, so I have everything already set up. Tests are already running.

May 07 '19 07:05 t4c1

I have updated the graphs with MKL results.

May 07 '19 09:05 t4c1

So Eigen MKL parallel gives already some speedup.... but this seems like a unfair comparison in that your CPU version is single-core, right?

How many cores were used for parallel?

In any case, it seems that your proposed variant speeds up things.

May 07 '19 09:05 wds15

Actually sequantial MKL speeds up things by around 20%, but that is hard to see on the scale of the graphs.

I don't think comparison is unfair, but speedups from running multiple chains in parallel are probably better.

Parallel MKL used all 4 cores.

May 07 '19 09:05 t4c1

@t4c1 is there a branch with these versions?

Mar 24 '24 18:03 spinkney

Huh that is old. I found the branch here: https://github.com/bstatcomp/math/tree/gpu_qr

Mar 25 '24 18:03 t4c1