Peter
Peter
Shouldn't be hard to add. Generally it just needs to make sure a few functions, like [NeuralAttentionlib's gemm](https://github.com/chengchingwen/NeuralAttentionlib.jl/blob/2b9a568280a96ce8c2b00433c3c7dbf8aa7cf9dd/src/matmul/gpu.jl#L11), dispatch to the correct functions. But I don't have amdgpu to test.
Could we combine this with https://github.com/JuliaDiff/ChainRulesCore.jl/pull/592 ?
> It's possible. I think that means having two distinct structs, ZygoteRuleConfig and ZygoteOnceRuleConfig or something. Or introduce another type parameter like `ZygoteRuleConfig{once} where once`? > At present, BTW, most...
> But I don't think that fits CR's mechanism; the current struct is
[nvvm](https://github.com/llvm/llvm-project/blob/ca2f53897a2f2a60d8cb1538d5fcf930d814e9f5/llvm/include/llvm/IR/IntrinsicsNVVM.td#L1292) seems to have intrinsic for asynchronous copy, would this help?
@maleadt Any pointer on where to start?
I was reading the [nvidia doc about async copy](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy) and it seems llvm only support a small portion of the instructions? Here are some code translated from the `async-copy.ll` ```jl...
> What pieces are missing? According to the doc, there are some extra argument of `cp.async.ca.shared.global`like `ignore-src`. > the idea is indeed to make something high-level like async_copyto! and have...
Does this work for CuArray? I mean IIUC NNlibCUDA is using the CUDNN softmax
> I thought NNlib stopped using the NNlibCUDA one, as it was slower. According to Cthulhu, nope: ```julia julia> @descend softmax(cu(randn(3,3)); dims=1) (::NNlib.var"#softmax##kw")(::Any, ::typeof(softmax), x::T) where T %20)::CUDA.CUDNN.cudnnSoftmaxAlgorithm_t │ │...