ITensors.jl icon indicating copy to clipboard operation
ITensors.jl copied to clipboard

[WIP] [NDTensors] Add `AMDGPU.jl` (ROCm) based extension for NDTensors

Open wbernoudy opened this issue 1 year ago • 19 comments

Description

This PR adds an AMDGPU.jl extension for NDTensors following the same structure as the CUDA extension.

It is still a WIP, but I am hoping to get feedback on a few points before finishing it up.

  1. Dot product on ITensors sometimes fails or falls back on slow default methods. If I implement a simple method that makes a copy and permutes the indices if they are different, I can then just call dot on the underlying ROCArrays, like so:
function dot(A::ITensor, B::ITensor)
  pB = B
  if (inds(A) != inds(B))
    pB = permute(B, inds(A))
  end
  return dot(NDTensors.data(tensor(A)), NDTensors.data(tensor(pB)))
end

I'm having trouble understanding which low level functions (presumably in LinearAlgebra) ITensors.dot actually calls. Trying to trace the functions in the Julia REPL brings me to NDTensors.contract but I can't figure out what it calls past that.

This brings me to a file in SimpleTraits.jl

julia> i, j = Index(10, "i"), Index(5, "j")
((dim=10|id=395|"i"), (dim=5|id=574|"j"))

julia> A = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=10|id=395|"i") (dim=5|id=574|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> B = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=10|id=395|"i") (dim=5|id=574|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> labelsA, labelsB = ITensors.compute_contraction_labels(inds(A), inds(B))
((-1, -2), (-1, -2))

julia> @edit NDTensors.contract(A.tensor, labelsA, B.tensor, labelsB)

I expect it's very similar to an issue I was having with other slow tensor contraction with ROCArray backed ITensors which I was fixed by exposing the right parameters for LinearAlgebra.mul! in AMDGPU.jl (https://github.com/JuliaGPU/AMDGPU.jl/pull/585).

  1. Though this PR does implement the rocSOLVER Jacobi method for SVD, I found it to be significantly slower than divide-and-conquer on CPU for the TDVP simulations I'm doing. To make using the CPU SVD work I added a roc_to_cpu_svd option for svd_alg which simply copies the matrix to host, calls SVD, and then converts USV back to ROCArrays. However, I'm guessing there may be a cleaner or more general way to do this (maybe to allow the other NDTensors extensions like the CUDA extension to also easily use CPU for SVD).

  2. When calling the rocSOLVER SVD algorithm, the function may try to allocate extra memory on the GPU (even though I can't find any documentation for why in the rocSOLVER documentation from AMD), so if the memory pool has allocated the entire GPU's memory (i.e. AMDGPU.Mem.free() returns 0) the function may fail. This is why I am trimming the pool if there is less than 1GB free before calling gesvdj!. I will try to investigate this more to see if there is a proper way to handle it.

  3. There is a small bug in the rocSOLVER SVD algorithms in AMDGPU.jl which has been fixed but not released yet (https://github.com/JuliaGPU/AMDGPU.jl/pull/588).

How Has This Been Tested?

I added the new NDTensors.roc function to the devices list in the NDTensors tests. However, there are several tests failing due to:

  1. The aforementioned dot issue
  2. Tests that call svd directly on ROCArrays which seems to return junk https://github.com/ITensor/ITensors.jl/blob/98b95a20f7b3c31324a2f23c730af4fc173b61f3/NDTensors/src/lib/Unwrap/test/runtests.jl#L94 I believe this should be fixed by AMDGPU.jl.
  3. There are many tests that are calling ql on ROCArrays which is not handled by AMDGPU.jl (similar to CUDA.jl). There seems to be an exception made for CUDA arrays in various QL decomposition functions, e.g. https://github.com/ITensor/ITensors.jl/blob/98b95a20f7b3c31324a2f23c730af4fc173b61f3/NDTensors/src/linearalgebra/linearalgebra.jl#L392 Not sure if I should add similar code for ROCArrays here.

I also added an example script similar to the CUDA one.

Checklist:

  • [x] My code follows the style guidelines of this project. Please run using JuliaFormatter; format(".") in the base directory of the repository (~/.julia/dev/ITensors) to format your code according to our style guidelines.
  • [x] I have performed a self-review of my own code.
  • [x] I have commented my code, particularly in hard-to-understand areas.
  • [x] I have added tests that verify the behavior of the changes I made.
  • [ ] I have made corresponding changes to the documentation.
  • [ ] My changes generate no new warnings.
  • [ ] Any dependent changes have been merged and published in downstream modules.

wbernoudy avatar Feb 02 '24 18:02 wbernoudy

At first glance this looks good, thanks! This is a nice payoff for a lot of work @kmp5VT and I have been doing to make the overload requirements for new GPU backends as minimal as possible, I'm glad to see this doesn't seem to require very much code to get working.

The SVD code definitely looks like it is the most complicated part that we will have to look at carefully, but besides that it all looks similar to what we needed to overload for the CUDA and Metal backends.

~In terms of dot, it just calls tensor contraction: https://github.com/ITensor/ITensors.jl/blob/v0.3.55/src/itensor.jl#L1905 so maybe you are hitting a corner case of tensor contraction that is going wrong for the AMDGPU backend? You would have to share a minimal example for us to have more information.~ Sorry, I see you shared an example and started investigating the call stack. I can't remember if we have a special code path for inner product-like contractions, we would have to look into that. Something you could do is use a profiler to try to investigate where the slow code is, or maybe disable scalar indexing and see if it errors somewhere.

mtfishman avatar Feb 02 '24 18:02 mtfishman

Looks like there is some package compatibility issue in the tests.

mtfishman avatar Feb 02 '24 19:02 mtfishman

Absolutely, this was far easier to get (mostly) working than I expected it to be, thanks to your and @kmp5VT 's great work!

And yes, there seem to be more parameters to the gesvdj functions available in rocSOLVER than in cuSOLVER. We can definitely reconsider the defaults I have set there or perhaps allow a way to pass them in from the DMRG algorithms.

I will try disabling scalar indexing to work on the dot issue, I didn't try that yet.

wbernoudy avatar Feb 02 '24 19:02 wbernoudy

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 53.78%. Comparing base (f4ad958) to head (c03a9bc).

:exclamation: Current head c03a9bc differs from pull request most recent head b0356a2. Consider uploading reports for the commit b0356a2 to get more accurate results

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1325       +/-   ##
===========================================
- Coverage   84.40%   53.78%   -30.62%     
===========================================
  Files         100       99        -1     
  Lines        8581     8528       -53     
===========================================
- Hits         7243     4587     -2656     
- Misses       1338     3941     +2603     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Feb 02 '24 19:02 codecov-commenter

One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR.

I think it is a subtle question to figure out how to select different SVD backends for different devices (say you want to specify that if you run on CUDA you want to use a certain cuLAPACK implementation, but then on AMD you want to transfer to CPU to perform SVD, then on Metal you want to...). That will be a longer discussion and may require us to think about the higher level SVD interface.

It may by that we should add support to the svd function that the alg keyword argument could in general be a function which takes in information about the backend and returns a string or Algorithm object depending on that backend, i.e.:

svd_alg(::Type{<:CuArray}) = "cpu"
svd_alg(::Type{<:ROCArray}) = "jacobi_algorithm"
svd_alg(::Type{<:MtlArray}) = "cpu"

u, s, v = svd(t, (i, j); alg=svd_alg)

We would also like to add automated testing for AMD GPUs, @kmp5VT will start to look into that.

mtfishman avatar Feb 02 '24 19:02 mtfishman

@wbernoudy I am currently working on finding a way to access an AMD GPU device. Would it be possible to provide me access to your branch so I can help accelerate your development? I am not that familiar with the AMDGPU.jl package but it does look like there is a rocblas and rocsolver directory which links the julia code to the C libraries in ROC. We should be able to access these functions through linearalgebra.jl , if we can't thats something we could bring up to AMDGPU.jl and should be a relatively easy implementation. regarding the dot function I believe we are using LinearAlgebra.mul! to launch a BLAS dot function.

kmp5VT avatar Feb 02 '24 20:02 kmp5VT

Would it be possible to provide me access to your branch so I can help accelerate your development?

Of course, I added you as a collaborator on my fork.

To be clear, this extension is already using the accelerated BLAS functions provided by AMDGPU.jl for tensor operations, and it is using the SVD provided in rocSOLVER (and even eigen methods).

AMDGPU.jl now provides both the three and five argument versions of LinearAlgebra.mul!. But I'm guessing there is a certain transformation on one or more of the matrices used somewhere in the call stack of dot that is not yet handled by AMDGPU.jl. I found another of these cases that I fixed here https://github.com/wbernoudy/ITensors.jl/blob/01f6c71844114eb92a9a50c4b23343161b9b028f/NDTensors/ext/NDTensorsROCmExt/mul.jl#L27

wbernoudy avatar Feb 02 '24 23:02 wbernoudy

Believe I figured this out, can skip to next comment

@mtfishman I tried the idea of disabling scalar indexing to see if some generic matmul function is using that in the dot code path, but no luck. I disabled the indexing simply by commenting out the relevant functions in NDTensors/ext/NDTensorsAMDGPUExt/indexing.jl, which I believe was the right thing to do to test this (and I do see it it reaching a ERROR: Scalar indexing is disallowed. on some calls). Is there another way it's allowing scalar indexing that I might be missing?

Here you can see the standard dot function does not hit any scalar indexing (other than the very final reading of the zero dimensional tensor) but is ~500x slower than calling dot on the underlying arrays.

julia> using AMDGPU

julia> using ITensors

julia> using NDTensors  # modified from current branch to disallow scalar indexing

julia> using LinearAlgebra

julia> i, j = Index(6152, "i"), Index(1324, "j")
((dim=6152|id=857|"i"), (dim=1324|id=611|"j"))

julia> A = ITensor(randomTensor(ROCArray, (i, j)))
ITensor ord=2 (dim=6152|id=857|"i") (dim=1324|id=611|"j")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> B = ITensor(randomTensor(ROCArray, (j, i)))
ITensor ord=2 (dim=1324|id=611|"j") (dim=6152|id=857|"i")
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> dot(A, B)
ERROR: Scalar indexing is disallowed.
(rest of stack trace)

julia> dag(A) * B
ITensor ord=0
Dense{Float64, ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}}

julia> AMDGPU.@allowscalar (dag(A) * B)[]
-3066.643034152186

julia> AMDGPU.@elapsed dag(A) * B
0.29850277f0

julia> function LinearAlgebra.dot(A::ITensor, B::ITensor)
         pB = B
         if (inds(A) != inds(B))
           pB = permute(B, inds(A))
         end
         return dot(data(tensor(A)), data(tensor(pB)))
       end

julia> dot(A, B)
-3066.643034153356

julia> AMDGPU.@elapsed dot(A, B)
0.000623521f0

wbernoudy avatar Feb 09 '24 23:02 wbernoudy

Actually I think I have figured out the slowdown with dot (so my previous comment can be ignored). The issue is actually that it is calling the AMDGPU gemm method, but this method itself is quite slow when doing the dot product on large vectors. Perhaps other GPU libraries (e.g. CUDA) are able to handle that case better, but I'm guessing the rocblas_{T}gemm functions are doing the naive thing and having threads write to the same output value atomically.

Calling dot directly on the arrays calls the method defined in the GPUArrays library, which uses mapreduce and seems to be far faster. This makes me think there should be definition of dot in ITensors.jl that does call dot on the underlying arrays instead of calling general matmul implicitly by contracting as normal. I'd be happy to make a separate PR for this.

wbernoudy avatar Feb 10 '24 00:02 wbernoudy

One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR.

This is exactly what I added the "roc_to_cpu_svd" alg type for. I would be happy to make this the default in this PR (instead of "jacobi_algorithm").

@mtfishman Would it be better if I removed the Jacobi implementation from this PR?

wbernoudy avatar Feb 10 '24 00:02 wbernoudy

One suggestion I would have to make this first PR simpler to review would be to just support an SVD implementation that transfers to CPU, and then add in other SVD backends in a future PR.

This is exactly what I added the "roc_to_cpu_svd" alg type for. I would be happy to make this the default in this PR (instead of "jacobi_algorithm").

Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new "*_to_cpu_svd" for every GPU backend? That doesn't seem like a sustainable forward-looking API. I think we will have to think more carefully about how to allow users to control device transfers in low level operations more generally, perhaps with an API that is orthogonal to the alg keyword argument.

@mtfishman Would it be better if I removed the Jacobi implementation from this PR?

Yes, I think so. The code you wrote seems more appropriate to add as an svd function to AMDGPU.jl, and given that it doesn't appear to help with performance it doesn't seem very compelling to add it here and have to support it. It would also simplify the code a lot and let us push off a decision about what to name the alg arguments.

mtfishman avatar Feb 12 '24 14:02 mtfishman

Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new "*_to_cpu_svd" for every GPU backend? That doesn't seem like a sustainable forward-looking API.

Very much agreed. An orthogonal argument that allows you to choose the device (in addition to the SVD algorithm) sounds great to me.

Either way, I'm happy to either wait until after a more precise API has been figured out to finalize this PR, or keep the awkward roc_to_cpu_svd for now. Or perhaps there is a cleaner way to do this now without the API change? My understanding is that we need something for the name of the SVD alg type for ROCMatrix, as it must be used here: https://github.com/ITensor/ITensors.jl/blob/2fd3e0a2658831903317b753934d04f37fcdcb46/NDTensors/src/linearalgebra/linearalgebra.jl#L86

The code you wrote seems more appropriate to add as an svd function to AMDGPU.jl

I figured this might be the case, and it certainly makes sense to handle the out-of-memory properly in AMDGPU.jl instead. I will take this out :+1:

wbernoudy avatar Feb 12 '24 18:02 wbernoudy

Yes, I understand that. I just don't like that interface. How would you choose the CPU SVD backend? Would we need to define a new "*_to_cpu_svd" for every GPU backend? That doesn't seem like a sustainable forward-looking API.

Very much agreed. An orthogonal argument that allows you to choose the device (in addition to the SVD algorithm) sounds great to me.

Either way, I'm happy to either wait until after a more precise API has been figured out to finalize this PR, or keep the awkward roc_to_cpu_svd for now. Or perhaps there is a cleaner way to do this now without the API change? My understanding is that we need something for the name of the SVD alg type for ROCMatrix, as it must be used here:

Glad we're in agreement about that. I think since there will be only one SVD backend for ROCMatrix with the current plan (one where we transfer to CPU), we wouldn't need to involve the alg keyword argument in this PR, see for example the factorizations defined for Metal: https://github.com/ITensor/ITensors.jl/blob/main/NDTensors/ext/NDTensorsMetalExt/linearalgebra.jl. So we could still finish this PR in a simpler form and then add back other SVD backends later as they become available, with a better API for choosing between those backends on GPU or transferring to CPU.

mtfishman avatar Feb 12 '24 18:02 mtfishman

Sorry, I didn't see you referring to default_svd_alg(T). I forget how that is handled for Metal but it could just be handled the same way, presumably it chooses the default that would be used for CPU SVD and then forwards that choice to the CPU SVD backend.

mtfishman avatar Feb 12 '24 18:02 mtfishman

Oh, I didn't think about just overloading LinearAlgebra.svd directly! That's much more straightforward for this. I will make that change.

wbernoudy avatar Feb 12 '24 19:02 wbernoudy

@wbernoudy Hi, just giving you an update. I now have access to an AMD GPU and have made a few changes (nothing significant). One thing is that the changes I am making in another PR also affects the work here. What I am doing is working to finish this other PR, pull the changes in here and make those modifications here as well so that we don't have to write this code twice. I am hoping to finish 1331 this week and put more time into your PR next week.

kmp5VT avatar Feb 20 '24 16:02 kmp5VT

As far as I can tell, the RUN NDTensors tests CI check is failing because latest version of AMDGPU.jl that supports Julia 1.6 is 0.2.17.

─AMDGPU [21141c5a] log: ... ├─restricted by julia compatibility requirements to versions: 0.1.0-0.2.17 or uninstalled, leaving only versions: 0.1.0-0.2.17

We will need a recent version of AMDGPU.jl like 0.8.11, which seems to require Julia 1.9. I'm guessing upgrading the CI check to Julia 1.9 may not be easy/feasible, but I'm not sure how else we can handle this.

wbernoudy avatar Mar 12 '24 17:03 wbernoudy

As far as I can tell, the RUN NDTensors tests CI check is failing because latest version of AMDGPU.jl that supports Julia 1.6 is 0.2.17.

─AMDGPU [21141c5a] log: ... ├─restricted by julia compatibility requirements to versions: 0.1.0-0.2.17 or uninstalled, leaving only versions: 0.1.0-0.2.17

We will need a recent version of AMDGPU.jl like 0.8.11, which seems to require Julia 1.9. I'm guessing upgrading the CI check to Julia 1.9 may not be easy/feasible, but I'm not sure how else we can handle this.

It seems like the community (by which I mean DifferentialEquations.jl) is moving towards only supporting Julia 1.9 and above, so I would be ok moving to that. However, for the sake of this PR, can't we just skip the tests which rely on AMDGPU.jl if the version is below 1.9?

mtfishman avatar Mar 12 '24 17:03 mtfishman

@wbernoudy I completely understand the problem you are running into and can definitely help you fix it. I think we are running into the same issue with Metal.jl and both will need to be addressed in order to create CI unittesting for these backends. @mtfishman We can definitely skip AMDGPU tests for now, as this is what we are doing currently with Metal.jl. But I would like to overall have a solution so we can properly unittest both backends. To do this just put AMDGPU in the [extras] of NDTensors/test/Project.toml and when you want to run AMDGPU tests locally, move it into the [deps] section To give you an update this PR is number 3 on my list of things to do. The first two things are related to the module TypeParameterAccessors. I am currently in the works of updating the system across the ITensors library. To limit the number of rewrites of this PR I want to finish the integration first. I would guess that it will take me about a week to finish this integration and next week I can get to this PR.

kmp5VT avatar Mar 12 '24 18:03 kmp5VT

@mtfishman If you have time to look through this PR I have made some comments in the code about why I have programmed things up. For the most part its exactly the same as the other extensions. I found two issues. One is that AMDGPU doesn't have an append! function which means that Blocksparse dmrg fails. The other is that there is something happening in qr when singular=true and real(elt) = Float32. I suspect this is in the ROCSolver library and I have disabled these tests. Otherwise all NDTensors test pass using AMDGPU.

kmp5VT avatar Mar 19 '24 21:03 kmp5VT

@wbernoudy Does this PR look good for you? Also do you know why I am seeing the F32 and ComplexF32 QR failing? Thanks!

kmp5VT avatar Mar 20 '24 15:03 kmp5VT

Looks good! Thanks @wbernoudy and @kmp5VT.

mtfishman avatar Mar 21 '24 20:03 mtfishman