irlba Rationale behind some of the bits and pieces in the C code

Rationale behind some of the bits and pieces in the C code

Open LTLA opened this issue 3 years ago • 4 comments

I've been writing a C++ port of irlba's C code (https://github.com/LTLA/CppIrlba) for use in some other applications of mine. It's been... tough going, but I think I've gotten it right, especially after cross-referencing to the Baglama and Reichel paper. Not that I really understand what's going on, but at least I know where the various steps are coming from.

That said, there's a few steps that I can't find in the paper - perhaps you can shed some light on why they exist:

https://github.com/bwlewis/irlba/blob/2ad759c142b1ef43cfdd21065e9ff4db22046256/src/irlb.c#L412-L415

Is this necessary? I understand that Gram-Schmidt orthogonalizes the (j+1)-th column against the j-th column, but it's followed by an full reorthogonalization via orthog() that does the same thing, so what's the rationale for this extra step?

https://github.com/bwlewis/irlba/blob/2ad759c142b1ef43cfdd21065e9ff4db22046256/src/irlb.c#L363

I'm guessing that the various RNORM() calls are intended to just fill W and V with some kind of content when the norm is zero, otherwise we'd end up with some nonsense (indefinite values or all-zero values). I'd further guess that the arbitrariness of the values don't really matter here because the associated singular values will be zero after the SVD in the main IRLBA loop.

I probably have a few other questions that will come to me as I stare at the code and paper for a while longer. Of course, any of your thoughts on the current state of the port are also appreciated.

Jun 29 '21 07:06 LTLA

Hi Aaron,

This is great! I'll try to help review and test your code; I'm not great with C++ but I can at least help with testing, etc.

On 6/29/21, Aaron Lun @.***> wrote:

I've been writing a C++ port of irlba's C code (https://github.com/LTLA/CppIrlba) for use in some other applications of mine. It's been... tough going, but I think I've gotten it right, especially after cross-referencing to the Baglama and Reichel paper. Not that I really understand what's going on, but at least I know where the various steps are coming from.

That said, there's a few steps that I can't find in the paper - perhaps you can shed some light on why they exist:

https://github.com/bwlewis/irlba/blob/2ad759c142b1ef43cfdd21065e9ff4db22046256/src/irlb.c#L412-L415

Is this necessary? I understand that Gram-Schmidt orthogonalizes the (j+1)-th column against the j-th column, but it's followed by an full reorthogonalization via orthog() that does the same thing, so what's the rationale for this extra step?

It helps to also compare to the reference R code, where this reorthogonalization step is optional, see:

https://github.com/bwlewis/irlba/blob/master/R/irlba.R#L601

This is also an optional parameter in JIm and Lothar's original algorithm.

In practice due to floating poinr arithmetic, the Krylov basis vectors lose orthogonality and a single GS vector orthogonalization is not enough. This happens to varying degree whenever the input matrix is ill-conditioned.

In the R code it's a user-controlled option. I think it's personally one to many minor parameters for people to deal with and so in the faster C code I made it mandatory favoring good results over speed. I recommend just always re-orthogonalizing the basis in practice (users will than you).

https://github.com/bwlewis/irlba/blob/2ad759c142b1ef43cfdd21065e9ff4db22046256/src/irlb.c#L363

I'm guessing that the various RNORM() calls are intended to just fill W and V with some kind of content when the norm is zero, otherwise we'd end up with some nonsense (indefinite values or all-zero values). I'd further guess that the arbitrariness of the values don't really matter here because the associated singular values will be zero after the SVD in the main IRLBA loop.

Almost any vector that is not in the null space of the original matrix will work. Obviously, the best vector(s) to start with are the solution singular vectors! This is how the current restart mechanism works (more on that below). But at its heart, IRLBA is a radomized method similar but different to the more widely known one by Halko, Martinsson, Tropp, Rokhlin, etc. The difference is that the Rokhlin approach iterates one polynomial over a large amount of data; irlba iterates over a small amount of data but many polynomials. See these old papre review notes for more if you're interested:

http://illposed.net/rsvd.html

The nice thing about a random vector is that it's unlikely to be near the null space of the input matrix, which is one of the error conditions possible.

I probably have a few other questions that will come to me as I stare at the code and paper for a while longer. Of course, any of your thoughts on the current state of the port are also appreciated.

I will look at the code carefully.

I still have not refactored the package code sorry. My plan was to use your approach to replace all the low-level C code. Separately I will put a completely separate C code library version that I have up somewhere (similar to your C++ library). I have been completely side-tracked for like a year by some health problems though.

Best,

Bryan

p.s. there are some differences here and there that improve on Jim's original work based on years of examples and experience. One thing that is not an improvement is the restart scheme in the current package, which Jim has complained to me is not as good as it should be. That restart stuff remains to be re-implemented someday, so be aware of that.

Jun 30 '21 11:06 bwlewis

p.s. would you consider being also a co-maintainer of the original irlba package since I am so negligent in updates sometimes? You're into this code quite a bit and have made many valuable contributions already...

On 6/30/21, Bryan Lewis @.***> wrote:

Hi Aaron,

This is great! I'll try to help review and test your code; I'm not great with C++ but I can at least help with testing, etc.

On 6/29/21, Aaron Lun @.***> wrote:

I've been writing a C++ port of irlba's C code (https://github.com/LTLA/CppIrlba) for use in some other applications of mine. It's been... tough going, but I think I've gotten it right, especially after cross-referencing to the Baglama and Reichel paper. Not that I really understand what's going on, but at least I know where the various steps are coming from.

That said, there's a few steps that I can't find in the paper - perhaps you can shed some light on why they exist:

https://github.com/bwlewis/irlba/blob/2ad759c142b1ef43cfdd21065e9ff4db22046256/src/irlb.c#L412-L415

Is this necessary? I understand that Gram-Schmidt orthogonalizes the (j+1)-th column against the j-th column, but it's followed by an full reorthogonalization via orthog() that does the same thing, so what's the rationale for this extra step?

It helps to also compare to the reference R code, where this reorthogonalization step is optional, see:

https://github.com/bwlewis/irlba/blob/master/R/irlba.R#L601

This is also an optional parameter in JIm and Lothar's original algorithm.

In practice due to floating poinr arithmetic, the Krylov basis vectors lose orthogonality and a single GS vector orthogonalization is not enough. This happens to varying degree whenever the input matrix is ill-conditioned.

In the R code it's a user-controlled option. I think it's personally one to many minor parameters for people to deal with and so in the faster C code I made it mandatory favoring good results over speed. I recommend just always re-orthogonalizing the basis in practice (users will than you).

https://github.com/bwlewis/irlba/blob/2ad759c142b1ef43cfdd21065e9ff4db22046256/src/irlb.c#L363

I'm guessing that the various RNORM() calls are intended to just fill W and V with some kind of content when the norm is zero, otherwise we'd end up with some nonsense (indefinite values or all-zero values). I'd further guess that the arbitrariness of the values don't really matter here because the associated singular values will be zero after the SVD in the main IRLBA loop.

Almost any vector that is not in the null space of the original matrix will work. Obviously, the best vector(s) to start with are the solution singular vectors! This is how the current restart mechanism works (more on that below). But at its heart, IRLBA is a radomized method similar but different to the more widely known one by Halko, Martinsson, Tropp, Rokhlin, etc. The difference is that the Rokhlin approach iterates one polynomial over a large amount of data; irlba iterates over a small amount of data but many polynomials. See these old papre review notes for more if you're interested:

http://illposed.net/rsvd.html

The nice thing about a random vector is that it's unlikely to be near the null space of the input matrix, which is one of the error conditions possible.

I probably have a few other questions that will come to me as I stare at the code and paper for a while longer. Of course, any of your thoughts on the current state of the port are also appreciated.

I will look at the code carefully.

I still have not refactored the package code sorry. My plan was to use your approach to replace all the low-level C code. Separately I will put a completely separate C code library version that I have up somewhere (similar to your C++ library). I have been completely side-tracked for like a year by some health problems though.

Best,

Bryan

p.s. there are some differences here and there that improve on Jim's original work based on years of examples and experience. One thing that is not an improvement is the restart scheme in the current package, which Jim has complained to me is not as good as it should be. That restart stuff remains to be re-implemented someday, so be aware of that.

Jun 30 '21 11:06 bwlewis

answers back to front...

p.s. would you consider being also a co-maintainer of the original irlba package since I am so negligent in updates sometimes? You're into this code quite a bit and have made many valuable contributions already...

I'm flattered but I don't think I have enough understanding of the theory to (co-)maintain it. I already knew my linear algebra was weak, but gee, attempting to read the paper really drove it home. I suspect that there is a decade of study between where I am now and the point at which I could confidently make improvements/fixes to the code. (For example, there were many lines in irlb.c that I thought, "this can't possibly be right, I'll complain to Bryan about this"... but it was, so I guess that's me told.)

That restart stuff remains to be re-implemented someday, so be aware of that.

Sure. I recently threw in some GHA tests to compare my implementation against the R package, which should allow me to keep track of changes on your end. (Assuming my test scenarios are comprehensive enough to trigger any changes in the restarts.)

I have been completely side-tracked for like a year by some health problems though.

Ah. Hope all is well.

Separately I will put a completely separate C code library version that I have up somewhere (similar to your C++ library).

I wonder if there is a way for us to combine our implementations so we can share the maintenance burden for the core logic. The code should compile in both C and C++; the problem is more about the dependencies during build. I'm eventually compiling to a system without BLAS or LAPACK, hence the use of Eigen (which makes compilations v-e-r-y slow)...

I'd have to think about that for a bit.

See these old papre review notes for more if you're interested:

I'm afraid that's several levels too deep for me, though I do appreciate the description of the distinction from rsvd(), having worked on C++ ports of both algorithms.

On a related note, if you were to ever write an "IRLBA for dummies", I would definitely read it.

I recommend just always re-orthogonalizing the basis in practice

Will do. So if I'm always doing the full reorthogonalization... is there a need for the Gram-Schmidt step at all? I did some empirical testing where its removal didn't change the result IIRC, but I wasn't sure whether there was a deeper purpose to that step.

Jul 01 '21 08:07 LTLA

Sure no problem about the package; there are some subtle ideas there. I like the idea of a write up of a basic intro to projected/partial SVDs (IRLBA for dummies). I'll think about that. Might be something Jim Baglama or Benjamin Erichson would be interested in collaborating on.

Speaking of rsvd, I looked at the Python randomized_svd method again (https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.randomized_svd.html) and it still has no sense of convergence. Just run a fixed number of iterations and take what you get. I find that approach pretty dumb. I think once a reasonable block Lanczos method is implemented up then there would be little reason to use the Rokhlin method in practice. I think that's basically what Matlab is doing now.

You're right about the extra single Gram-Schmidt step, it is superfluous. It came from my reference R implementation--where it is needed because reorthogonalization is optional. When I decided to make reorthogonalization mandatory in the C code I forgot to remove that!

I have an old C library version I wrote for another project that I'm cleaning up and will put somewhere for posterity. Maybe it (and I) can help with your new implementation, but first I'll just get the code cleaned up a bit and then we'll see.

Best,

Bryan

On 7/1/21, Aaron Lun @.***> wrote:

answers back to front...

p.s. would you consider being also a co-maintainer of the original irlba package since I am so negligent in updates sometimes? You're into this code quite a bit and have made many valuable contributions already...

I'm flattered but I don't think I have enough understanding of the theory to (co-)maintain it. I already knew my linear algebra was weak, but gee, attempting to read the paper really drove it home. I suspect that there is a decade of study between where I am now and the point at which I could confidently make improvements/fixes to the code. (For example, there were many lines in irlb.c that I thought, "this can't possibly be right, I'll complain to Bryan about this"... but it was, so I guess that's me told.)

That restart stuff remains to be re-implemented someday, so be aware of that.

Sure. I recently threw in some GHA tests to compare my implementation against the R package, which should allow me to keep track of changes on your end. (Assuming my test scenarios are comprehensive enough to trigger any changes in the restarts.)

I have been completely side-tracked for like a year by some health problems though.

Ah. Hope all is well.

Separately I will put a completely separate C code library version that I have up somewhere (similar to your C++ library).

I wonder if there is a way for us to combine our implementations so we can share the maintenance burden for the core logic. The code should compile in both C and C++; the problem is more about the dependencies during build. I'm eventually compiling to a system without BLAS or LAPACK, hence the use of Eigen (which makes compilations v-e-r-y slow)...

I'd have to think about that for a bit.

See these old papre review notes for more if you're interested:

I'm afraid that's several levels too deep for me, though I do appreciate the distinction from rsvd(), having worked on C++s port of algorithms.

On a related note, if you were to ever write an "IRLBA for dummies", I would definitely read it.

I recommend just always re-orthogonalizing the basis in practice

Will do. So if I'm always doing the full reorthogonalization... is there a need for the Gram-Schmidt step at all? I did some empirical testing where its removal didn't change the result IIRC, but I wasn't sure whether there was a deeper purpose to that step.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/bwlewis/irlba/issues/60#issuecomment-872031558

Jul 04 '21 15:07 bwlewis

irlba irlba copied to clipboard

Rationale behind some of the bits and pieces in the C code

irlba
irlba copied to clipboard