Natalia Gimelshein comments

Results 214 comments of


                                            Natalia Gimelshein

[cuDNN][SDPA] Match `query`'s memory layout ordering for `output` in cuDNN SDPA

>this fix might cause the grad output stride to no longer match the output's stride in a common cases Out of curiosity, what would be a common case where gradOutput...

158232 Fix autocast cache incorrectly retaining no_grad state

Please fix the PR description to reflect the actual PR change. If people complain about extra memory use we'd need to expose control over which mode gets cached, but for...

158232 Fix autocast cache incorrectly retaining no_grad state

@janeyx99 are you ok with landing this?

158232 Fix autocast cache incorrectly retaining no_grad state

Dynamo/inductor already turn off autocast caching

[BUG] CuTEDSL with tvm-ffi is globally resetting device

Thank you! Is there a link to the fix?

[BUG] CuTEDSL with tvm-ffi is globally resetting device

I thought the issue originates with https://github.com/NVIDIA/cutlass/blob/8cd5bef43a2b0d3f9846b026c271593c6e4a8e8a/python/CuTeDSL/cutlass/cutlass_dsl/tvm_ffi_provider.py#L256 that inserts only one cudaSetDevice and doesn't restore device?

[BUG] CuTEDSL with tvm-ffi is globally resetting device

So it's not internal LLVM/MLIR driver code, it should go to cutlass repo

[BUG] CuTEDSL with tvm-ffi is globally resetting device

One super annoying thing about cudaSetDevice is that it initializes context and pytorch goes through a lot of pain to prevent it. So e.g. ``` a=torch.randn(4, device="cuda:1") b=torch.randn(4, device="cuda:1") a+b...

[BUG] CuTEDSL with tvm-ffi is globally resetting device

Since pytorch handles this situation by calling `cudaSetDevice` I think it would be good for tvm-ffi to do the same. The only minor issue as I said is avoiding initializing...

[BUG] CuTEDSL with tvm-ffi is globally resetting device

Thank you!