MPSKit.jl icon indicating copy to clipboard operation
MPSKit.jl copied to clipboard

SVD using the divide and conquer algorithm might fail for certain tensors.

Open Gertian opened this issue 1 year ago • 8 comments

Currently the default algorithm for singular value decomposition is a divide and conquer algorithm.

From an optimization perspective this is great. Indeed, for large tensors such algorithms outperform the standard method Householder(?) method but they are known to be less stable and prone to crashing.

One specific stack trace for such a crash in MPSKit might be : image but I suspect there are a multiple a ways to reproduce this crash.

In my particular case the issue was patched by changing projectisometric! from TensorKitManifolds.jl into :

function projectisometric!(W::AbstractTensorMap; alg=Polar())
    if alg isa TensorKit.Polar || alg isa TensorKit.SDD
        foreach(blocks(W)) do (c, b)
            #sometimes the divide and conquer algorithm used in polarsdd fails and gives LAPACK error 1
            #in that case, we try the standard SVD algorithm which is slightly more expensive but more robust
            #in case that fails I'll be a sad camper... (spoiler, it doesn't)
            b_backup = copy(b)
            try 
                return _polarsdd!(b)
            catch
                b = b_backup
                return _polarsvd!(b)
            end
        end
    elseif alg isa TensorKit.SVD
        foreach(blocks(W)) do (c, b)
            return _polarsvd!(b)
        end
    elseif alg isa PolarNewton
        foreach(blocks(W)) do (c, b)
            return _polarnewton!(b)
        end
    else
        throw(ArgumentError("unkown algorithm for projectisometric!: alg = $alg"))
    end
    return W
end

so that the slower, more stable method is used whenever the initial attempt fails.

Clearly this solution is quite (one could even say very) messy and slow (since it has to make a backup of the to-be-svd'd tensor) so that much improvement is still possible. In particular MPSKit should offer the option to choose different SVD decompositions on default. And perhaps there is a smarter way to implement the try cath loop without the necessity of a backup ?

Gertian avatar Jan 23 '24 14:01 Gertian

This has been a real pain in the past, so at some point I decided to switch over to the slower-but-stable method in mpskit (at some point we should probably make this configurable on the algorithm level. You could then even provide a fallback that tries the fast method first, and only relies on the fallback once it crashes). I think the crash you see is not mpskit's fault, but rather TensorKitManifold's! https://github.com/Jutho/TensorKitManifolds.jl/blob/7391fe01a4c60eb06ef8fcf2a3b4849bffdbce1e/src/grassmann.jl#L172

Out of curiousity, what happens if you continue polarsvd! on a failed polarsdd! block? Without the copy?

Op di 23 jan 2024 om 15:38 schreef Gertian @.***>:

Currently the default algorithm for singular value decomposition is a divide and conquer algorithm.

From an optimization perspective this is great. Indeed, for large tensors such algorithms outperform the standard method Householder(?) method but they are known to be less stable and prone to crashing https://discourse.julialang.org/t/lapackexception-1-while-svd-but-not-svdvals/23787 .

One specific stack trace for such a crash in MPSKit might be : image.png (view on web) https://github.com/maartenvd/MPSKit.jl/assets/11559970/f465191f-b4d7-420d-8cf7-2185a37a836c but I suspect there are a multiple a ways to reproduce this crash.

In my particular case the issue was patched by changing projectisometric! into :

function projectisometric!(W::AbstractTensorMap; alg=Polar()) if alg isa TensorKit.Polar || alg isa TensorKit.SDD foreach(blocks(W)) do (c, b) #sometimes the divide and conquer algorithm used in polarsdd fails and gives LAPACK error 1 #in that case, we try the standard SVD algorithm which is slightly more expensive but more robust #in case that fails I'll be a sad camper... (spoiler, it doesn't) b_backup = copy(b) try return _polarsdd!(b) catch b = b_backup return _polarsvd!(b) end end elseif alg isa TensorKit.SVD foreach(blocks(W)) do (c, b) return _polarsvd!(b) end elseif alg isa PolarNewton foreach(blocks(W)) do (c, b) return _polarnewton!(b) end else throw(ArgumentError("unkown algorithm for projectisometric!: alg = $alg")) end return Wend

so that the slower, more stable method is used whenever the initial attempt fails.

Clearly this solution is quite (one could even say very) messy and slow (since it has to make a backup of the to-be-svd'd tensor) so that much improvement is still possible.

— Reply to this email directly, view it on GitHub https://github.com/maartenvd/MPSKit.jl/issues/109, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCW5PUTMVPUAB7VGYQ3YP7DM3AVCNFSM6AAAAABCHDN2BCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TMMRTGIZTEOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

maartenvd avatar Jan 23 '24 14:01 maartenvd

I suggested Gertian to post this here rather than in TensorKitManifolds, as I assumed that it would also affect other algorithms. I did not know that you opted for GESVD globally throughout MPSKit.

Anyway, I think we indeed want to be able to configure this, which requires changes both here and in TensorKitManifolds.jl, cause there I started supporting both with the _polar... methods, but then this is not continued through the higher level retract functions.

Jutho avatar Jan 23 '24 14:01 Jutho

The problem with the fallback strategy is that GESDD might already ruin your matrix before crashing, so you need to make a copy of the original matrix just in case.

Jutho avatar Jan 23 '24 14:01 Jutho

I hoped gesdd simply failed to converge, and that the data would be salvageable. If it isn't, then the overhead of a copy combined with the slow garbage collector might make trying out the fast method first not worth it...

I agree that we want to make it configurable. I stopped using polar decompositions because of this instability, so the only places where mpskit still does svd's should be in two-site schemes and in changebonds, in both places the svd algorithm can become a field in the algorithm struct. If tensorkitmanifolds provides some way of specifying the polar/svd algorithm, then fixing this crash will be a small change to the grassmann code.

Has there been any progress on optimkit's linesearch? We might combine the update with rewriting the grassmann code again :)

Op di 23 jan 2024 om 15:58 schreef Jutho @.***>:

The problem with the fallback strategy is that GESDD might already ruin your matrix before crashing, so you need to make a copy of the original matrix just in case.

— Reply to this email directly, view it on GitHub https://github.com/maartenvd/MPSKit.jl/issues/109#issuecomment-1906230744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCRU4S5VNHSIYPRWVU3YP7F27AVCNFSM6AAAAABCHDN2BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGIZTANZUGQ . You are receiving this because you commented.Message ID: <maartenvd/MPSKit .@.***>

maartenvd avatar Jan 23 '24 15:01 maartenvd

I quickly tested the amount of overhead and it seems to be the case that my first bug-fix is quite a bit slower than just defaulting to the stable version.

personally I think it's more logical to have the stable version as a default since end-users that are less familiar with numerics might be scared away by a random crash like the one I experienced.

Gertian avatar Jan 24 '24 12:01 Gertian

I think I agree with setting the default values as stable as possible, and afterwards trying to be able to incorporate the option of having the faster version. The time gained in using a faster algorithm is easily lost by having to deal with crashes, especially if you have no idea why or what is going on.

lkdvos avatar Jan 25 '24 12:01 lkdvos

so plan of action - we first change tensorkitmanifolds to allow passing in the svd algorithm, and then change mpskit to pick the stable one?

Op do 25 jan 2024 om 13:39 schreef Lukas @.***>:

I think I agree with setting the default values as stable as possible, and afterwards trying to be able to incorporate the option of having the faster version. The time gained in using a faster algorithm is easily lost by having to deal with crashes, especially if you have no idea why or what is going on.

— Reply to this email directly, view it on GitHub https://github.com/maartenvd/MPSKit.jl/issues/109#issuecomment-1910121383, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJKVCVFHT3QXH4GTL75AYTYQJHA3AVCNFSM6AAAAABCHDN2BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGEZDCMZYGM . You are receiving this because you commented.Message ID: <maartenvd/MPSKit .@.***>

maartenvd avatar Jan 25 '24 14:01 maartenvd

So the retract methods of TensorKitManifolds do actually have an alg keyword, but it is unused for Grassmann and Unitary. This was supposed to be used for different retraction schemes (it is used for Stiefel).

What I am confused by though, is that I thought that the retraction schemes were actually creating a new isometric matrix automatically, and that the projectisometric which used the polar decomposition was just there to get rid of rounding errors. So this means that the matrix would already be close to isometric, and therefore be very well conditioned.

Of course, this also means that its singular values would all be very close to one, and maybe this is what is giving the DivideAndConquer scheme a hard time.

Jutho avatar Jan 26 '24 15:01 Jutho

We should just put U, S, V, = tsvd(Δ.Z, alg = SVD()) in https://github.com/Jutho/TensorKitManifolds.jl/blob/7391fe01a4c60eb06ef8fcf2a3b4849bffdbce1e/src/grassmann.jl#L59 to fix this issue. (At least in my particular case)

Gertian avatar Aug 02 '24 08:08 Gertian

I suggested Gertian to post this here rather than in TensorKitManifolds, as I assumed that it would also affect other algorithms. I did not know that you opted for GESVD globally throughout MPSKit.

Anyway, I think we indeed want to be able to configure this, which requires changes both here and in TensorKitManifolds.jl, cause there I started supporting both with the _polar... methods, but then this is not continued through the higher level retract functions.

Actually it seems like you partially started supporting it... image I'll try to complete what you started here. I'm not familiar with the entire code so I'll just follow the instances of the above stacktrace and make a PR. If there are more places that need change we can then change this PR.

Gertian avatar Aug 12 '24 11:08 Gertian

Ok I think I did the neccecairy changes for this to work in : https://github.com/QuantumKitHub/MPSKit.jl/tree/SVD/SDD_fix and https://github.com/Gertian/TensorKitManifolds.jl

In TensorKitManifolds I simply made it so that alg actually functions. @Jutho had already started this but didn't finish.

In MPSKit, I added an additional field svd_arg to the GradientGrassmann struct which can either be SVD/SDD (I didn't implement this to be the only two options because I'm not sure what the existing machinery for this is) . These arguments are then passed to TensorKitManifolds whenever an SVD is performed.

Gertian avatar Aug 12 '24 13:08 Gertian

So, in conclusion, the new approach will have the default algorithm as SVD (stable/slower), already in TensorKitManifolds, such that from the MPSKit side of things nothing has to happen.

lkdvos avatar Sep 03 '24 06:09 lkdvos