Natalia Gimelshein
Natalia Gimelshein
>this fix might cause the grad output stride to no longer match the output's stride in a common cases Out of curiosity, what would be a common case where gradOutput...
Please fix the PR description to reflect the actual PR change. If people complain about extra memory use we'd need to expose control over which mode gets cached, but for...
@janeyx99 are you ok with landing this?
Dynamo/inductor already turn off autocast caching
Thank you! Is there a link to the fix?
I thought the issue originates with https://github.com/NVIDIA/cutlass/blob/8cd5bef43a2b0d3f9846b026c271593c6e4a8e8a/python/CuTeDSL/cutlass/cutlass_dsl/tvm_ffi_provider.py#L256 that inserts only one cudaSetDevice and doesn't restore device?
So it's not internal LLVM/MLIR driver code, it should go to cutlass repo
One super annoying thing about cudaSetDevice is that it initializes context and pytorch goes through a lot of pain to prevent it. So e.g. ``` a=torch.randn(4, device="cuda:1") b=torch.randn(4, device="cuda:1") a+b...
Since pytorch handles this situation by calling `cudaSetDevice` I think it would be good for tvm-ffi to do the same. The only minor issue as I said is avoiding initializing...