xformers
xformers copied to clipboard
ScaledDotProduct cannot run on cuda:1
🐛 Bug
When I run on cuda:0, everything works fine. But when I run cuda:1, the following error occurs:
Triton softmax kernel register spillover or invalid image caught.Deactivating this kernel, please file an issue int the xFormers repository
Triton Error [CUDA]: context is destroyed
Command
To Reproduce
from xformers.components.attention import ScaledDotProduct
model = ScaledDotProduct().cuda(1)
import torch
q = torch.randn(16, 16, 64).cuda(1).requires_grad_()
k = torch.randn(16, 8, 64).cuda(1).requires_grad_()
v = torch.randn(16, 8, 64).cuda(1).requires_grad_()
mask = torch.tensor([[True] + [False]*7] * 16, dtype=torch.bool).cuda(1)
out = model(q, k, v, att_mask=mask)
breakpoint()
Environment
latest xformers built from source
I'm not sure about this error - maybe @fmassa you know who would be the right POC there? Also, do you have a stacktrace for the error @1049451037 ?
I am also getting this error
Is it possible to disable the triton softmax kernel as a temporary workaround?
Same problem here. How bad is this bug, i.e. can I simply ignore the "Triton Error"?
I think triton launches on the current cuda device and you usually want the tensors you pass it to be in that device. This means you might want to change that device manually, unlike most simple pytorch operations which run on the device of their inputs.
So in the code above, you could maybe try replacing
out = model(q, k, v, att_mask=mask)
with
with torch.cuda.device(q.device):
out = model(q, k, v, att_mask=mask)