cca_zoo icon indicating copy to clipboard operation
cca_zoo copied to clipboard

predict() method for models?

Open dmeliza opened this issue 2 years ago • 6 comments

The scikit-learn implementations of PLS and CCA have predict() methods that are very useful for cross-validation and forecasting. Is it possible to add these to cca-zoo models where appropriate?

dmeliza avatar Sep 14 '23 15:09 dmeliza

Pushed a version of this to main

jameschapman19 avatar Sep 14 '23 17:09 jameschapman19

Works slightly differently to scikit-learn you pass views with optional missing views as None and it reconstructs all of the views from the learnt latent dimensions.

jameschapman19 avatar Sep 14 '23 17:09 jameschapman19

Thanks! I'll check it out.

dmeliza avatar Sep 14 '23 17:09 dmeliza

This works well with my data, but only if the view data are whitened first. I'm not enough of an expert in these methods to say why this might be, but it looks like the methods for generating predictions are quite different in cca-zoo compared to sklearn's PLSRegression.

dmeliza avatar Sep 19 '23 13:09 dmeliza

If you come back to me in a week and a half I think I will be able to come up with a more detailed response and fix.

Basically your observation is exactly what I would expect and a colleague of mine has been thinking about this in some depth recently.

We learn weights W_x which transform XW_x=Z_x and W_y which transform YW_y=Z_y. Going from data to latent space is usually known as a backward problem.

For prediction (or 'generation') we need a forward problem.

For PLS it turns out the forward problem is X=ZW_x^T and Y=ZW_y^T

But for CCA the forward problem is actually X=ZW_x^T\Sigma_X and Y=ZW_y^T\Sigma_Y.

The predict function I wrote up quickly for you uses the PLS forward problem (because that's what scikit-learn appears to do).

But notice that if Sigma_X is Identity then the forward problems are the same. Sigma_X is identity when your data is whitened and that's why you are seeing what you are seeing.

Based on the above you might be able to implement a CCA prediction function without my help and if you do get a change feel free to send a PR :) otherwise I'll do it when I get a moment.

jameschapman19 avatar Sep 19 '23 15:09 jameschapman19

I've been digging through the code and looking at weights, scores, loadings with my data, and I'm starting to think prediction may be broken for some models in scikit-learn.

To set the context, Y is 58000 by 40 and X is 58000 by 1500. sklearn's PLSRegression works reasonably well with about 10 components; sklearn.cross_decomposition.PLSCanonical, cca_zoo.linear.PLS and cca_zoo.linear.CCA all produce horrible in-sample predictions unless I whiten the inputs. However, whitening totally destroys out-of-sample performance, so it's not an option.

For PLSRegression (i.e. PLS2), prediction works great for unwhitened data. The class computes a "rotation matrix" Pₓ that gives Zₓ = XPₓ. It's using Pₓ = Wₓ(ΓᵀWₓ)^{-1} rather than just Wₓ as in your example above. Γᵀ being the matrix of X loadings. Then the prediction is Y = XPₓΔᵀ where Δᵀ is the matrix of loadings for Y. This works because Z_y ≈ Zₓα, with α = 1: if I fit a line through the X and Y scores it has an intercept of 0 and a slope of 1.

For PLSCanonical, which I think is the same flavor of PLS as cca_zoo.linear.PLS, α is not equal to 1, and it's different for each of the components. So the predictions from the different components are not being scaled appropriately, and the overall predictions look like garbage, because the first component accounts for the lion's share of the variance. I am guessing that this α is the same as your σ in your post above?

The reason I think there's an error in sklearn is that according to the User Guide, this factor α needs to be inferred from the data, but I don't see anywhere in the code that it does this. This is my very naive way of trying to fix it:

fm = LinearRegression()
fm.fit(model._x_scores, model._y_scores)
alpha = np.diag(np.diag(fm.coef_))

pred = X_test_scaled @ model.x_rotations_ @ alpha @ model.y_loadings_.T

It seems to work, although I'm sure there's a better way to get α than multiple regression. I haven't tried yet with CCA. If you have a more sophisticated solution I'm happy to write up a PR, and I can submit an issue to sklearn as well.

dmeliza avatar Sep 20 '23 13:09 dmeliza