scikit-learn-intelex icon indicating copy to clipboard operation
scikit-learn-intelex copied to clipboard

[QST] Methods + ideas behind further speeding up ML algorithms - Linear Regression, PCA, KNN

Open danielhanchen opened this issue 3 years ago • 0 comments
trafficstars

Hey there! First, fabulous work on speeding up Sklearn! I'm the author of Hyperlearn (https://github.com/danielhanchen/hyperlearn), where I also tried investigating how to speed up algos on the CPU (all the way from SSE4 to AVX512) and GPUs (was previously at NVIDIA's RAPIDS team - mainly responsible for 2000x faster TSNE, 50% less memory for SVD, and was working on sparse randomized svd, fast NMF etc)

I was trying to understand the main methods behind some of the algorithms.

  1. Linear regression Tall skinny matrices can be dramatically sped up using the normal equations inv(X.T @ X) @ (X.T @ y). Sklearn sadly just uses the slow SVD approach gelsd. Sklearn guarantees stability though. During my investigation, approximately fast linear regression for tall skinny matrices is:
XTX = X.T @ X = ssyrk(X)
XTy = X.T @ y = sgemm(X.T, y) / sgemv(X.T, y)
U = spotrf(XTX)
z = U@b = strtrs(U.T, Xty) [U.T@(U@b) = XTy]
b = strtrs(U, z) [U@b = z]

Cholesky can cause issues, so Ridge Regularization can help. This approach is probably the least amount of FLOPs possible to find a solution. Some workspace memory can be reduced as well.

However, in my findings, pivoted cholesky (spstrf) can make the solution in fact solve all solutions, even ill conditioned ones, albeit with a higher condition number. The pivoting is much more tricky to handle.

Is this approximately what Intel's version is using to speedup linear regression?

  1. PCA PCA requires (X - mean(X)) first. However, we show surprisingly that rather if you use syevd / syevr to find the eigenvectors, you can in fact bypass copying the matrix X for mean removal. You can use a smart matrix trick after expanding the formula to save O(NP) memory. [N = rows, P = columns]

  2. PCA Scipy's syed / syevr comparison funnily was one my old hyperlearn results. https://github.com/scipy/scipy/issues/9212 A long time agao I mentioned how large matrices must use syevd for superior performance, and syevr is good for smaller matrices. We found a heuristic of selecting which to use (syevr / syevd).

  3. KNN. Nearest neighbors is actually quite interesting to improve on CPUs. Pairwise distances can easily be made faster directly for Euclidean and Cosine similarity / distance using some matrix algebra. Both require the computation of X @ X.T, which you can use again ssyrk again.

  4. Randomized SVD. Not sure if Intel's package has dramatically sped this one up. We spent a lot of time into the weeds. We found by replacing the range finder from QR (we can even toggle the finding of R off) to stable LU decomposition then using stable cholesky + some other sneaking tricks, huge speedups was possible with few loss in accuracy. Likewise Randomized PCA on sparse data was 100% possible using our special PCA trick I said.

Anyways I was just curious on the details behind the algorithms Intel has devised :)) All my code is available for use - though now I'm running a startup Moonshot (www.moonshotai.org) were our goal is to make all software faster!

Thanks for your time!

danielhanchen avatar May 23 '22 16:05 danielhanchen