scikit-learn-intelex
scikit-learn-intelex copied to clipboard
[QST] Methods + ideas behind further speeding up ML algorithms - Linear Regression, PCA, KNN
Hey there! First, fabulous work on speeding up Sklearn! I'm the author of Hyperlearn (https://github.com/danielhanchen/hyperlearn), where I also tried investigating how to speed up algos on the CPU (all the way from SSE4 to AVX512) and GPUs (was previously at NVIDIA's RAPIDS team - mainly responsible for 2000x faster TSNE, 50% less memory for SVD, and was working on sparse randomized svd, fast NMF etc)
I was trying to understand the main methods behind some of the algorithms.
- Linear regression Tall skinny matrices can be dramatically sped up using the normal equations
inv(X.T @ X) @ (X.T @ y). Sklearn sadly just uses the slow SVD approach gelsd. Sklearn guarantees stability though. During my investigation, approximately fast linear regression for tall skinny matrices is:
XTX = X.T @ X = ssyrk(X)
XTy = X.T @ y = sgemm(X.T, y) / sgemv(X.T, y)
U = spotrf(XTX)
z = U@b = strtrs(U.T, Xty) [U.T@(U@b) = XTy]
b = strtrs(U, z) [U@b = z]
Cholesky can cause issues, so Ridge Regularization can help. This approach is probably the least amount of FLOPs possible to find a solution. Some workspace memory can be reduced as well.
However, in my findings, pivoted cholesky (spstrf) can make the solution in fact solve all solutions, even ill conditioned ones, albeit with a higher condition number. The pivoting is much more tricky to handle.
Is this approximately what Intel's version is using to speedup linear regression?
-
PCA PCA requires (X - mean(X)) first. However, we show surprisingly that rather if you use syevd / syevr to find the eigenvectors, you can in fact bypass copying the matrix X for mean removal. You can use a smart matrix trick after expanding the formula to save O(NP) memory. [N = rows, P = columns]
-
PCA Scipy's syed / syevr comparison funnily was one my old hyperlearn results. https://github.com/scipy/scipy/issues/9212 A long time agao I mentioned how large matrices must use syevd for superior performance, and syevr is good for smaller matrices. We found a heuristic of selecting which to use (syevr / syevd).
-
KNN. Nearest neighbors is actually quite interesting to improve on CPUs. Pairwise distances can easily be made faster directly for Euclidean and Cosine similarity / distance using some matrix algebra. Both require the computation of X @ X.T, which you can use again ssyrk again.
-
Randomized SVD. Not sure if Intel's package has dramatically sped this one up. We spent a lot of time into the weeds. We found by replacing the range finder from QR (we can even toggle the finding of R off) to stable LU decomposition then using stable cholesky + some other sneaking tricks, huge speedups was possible with few loss in accuracy. Likewise Randomized PCA on sparse data was 100% possible using our special PCA trick I said.
Anyways I was just curious on the details behind the algorithms Intel has devised :)) All my code is available for use - though now I'm running a startup Moonshot (www.moonshotai.org) were our goal is to make all software faster!
Thanks for your time!