Speedup QR decomposition
Description
Eigen's QR decomposition can be improved on with better parameter tunning. GPUs can be used for further speedup.
Example
QR decomposition is faster.
Expected Output
QR decomposition is faster.
Current Version:
v2.19.1
I have implemented three versions of the algorithm. First uses CPU, second GPU, and the last is hybrid version that uses GPU only for larger matrix products.
Here are some graphs showing speedups relative to implementation from Eigen. Even my CPU version is faster, despite being the same algorithm. I guess mine has better tuned its parameter (block size) for my CPU. Measurements were done on Core i5 2500 and GTX 1070.

The question is whether we want all three in stan math? CPU and hybrid are relatively simple implementations, while full GPU needs four new kernels.
In case you have a Intel CPU.. could you compile things with the Intel MKL as a backend for Eigen? As far as I am aware Eigen can use that and it should make a difference for Intel CPUs at least.
I will try.
Thanks for trying... I know this is a lot of work... but maybe the MKL is a great option as well. As we head in the direction of these optimisations it is good to have some overview.
I have used Eigen with MKL before, so I have everything already set up. Tests are already running.
I have updated the graphs with MKL results.
So Eigen MKL parallel gives already some speedup.... but this seems like a unfair comparison in that your CPU version is single-core, right?
How many cores were used for parallel?
In any case, it seems that your proposed variant speeds up things.
Actually sequantial MKL speeds up things by around 20%, but that is hard to see on the scale of the graphs.
I don't think comparison is unfair, but speedups from running multiple chains in parallel are probably better.
Parallel MKL used all 4 cores.
@t4c1 is there a branch with these versions?
Huh that is old. I found the branch here: https://github.com/bstatcomp/math/tree/gpu_qr