Ridge Regularization in linear solve: consistency
It will be easier to discuss this on github rather than internal Google doc.
The issue
Current state of API:
| Function | Without ridge | With ridge r > 0 | Remark |
|---|---|---|---|
| solve_cg | Ax=b | (A+rI)=b | well posed because A is PSD |
| solve_gmres | Ax=b | (A+rI)=b | ill-posed if A=-rI |
| solve_bicgstab | Ax=b | (A+rI)=b | ill-posed if A=-rI |
| solve_normal_cg | A^TAx=A^Tb | (A^TA+rI)x=A^Tb | well posed because A^TA is PSD |
There are consistency issues here: with ridge regularization we expect (A^T+rI)(A+rI)x=(A^T+rI)b for solve_normal_cg. Consequently all of solve_cg, solve_gmres and solve_bicgstab are interchangeable when r > 0, but not with solve_nornal_cg. Worse: when r=0 they are all interchangeable with each other (at least for PD matrices).
Discussion:
Tikhonov regularization
regularizes with A^TA+rI - just like solve_cg. This guarantees a well posed problem.
Other observation: most solvers of Sklearn for Ridge regression uses the A^TA+rI trick.
No one uses A+rI on a general matrix A: it only makes sense to do so on PSD matrix in general.
Solution
Two solutions:
- Change
(A^TA+rI)x=A^Tbinto(A^T+rI)(A+rI)x=(A^T+rI)b, but in this case the problem is ill-posed forA=-rI. - Remove ridge regularization from gmres/bicgstab because currently the regularization may lead to matrix with worse condition number (unexpected behavior when we regularize a system). This happens in particular for NSD matrices.
I am in favor of the second option to remain consistent with literature; unless we can prove that the A+rI approach makes sense.