xformers icon indicating copy to clipboard operation
xformers copied to clipboard

ScaledDotProduct cannot run on cuda:1

Open 1049451037 opened this issue 1 year ago • 5 comments

🐛 Bug

When I run on cuda:0, everything works fine. But when I run cuda:1, the following error occurs:

Triton softmax kernel register spillover or invalid image caught.Deactivating this kernel, please file an issue int the xFormers repository
Triton Error [CUDA]: context is destroyed

Command

To Reproduce

from xformers.components.attention import ScaledDotProduct

model = ScaledDotProduct().cuda(1)
import torch
q = torch.randn(16, 16, 64).cuda(1).requires_grad_()
k = torch.randn(16, 8, 64).cuda(1).requires_grad_()
v = torch.randn(16, 8, 64).cuda(1).requires_grad_()
mask = torch.tensor([[True] + [False]*7] * 16, dtype=torch.bool).cuda(1)
out = model(q, k, v, att_mask=mask)
breakpoint()

Environment

latest xformers built from source

1049451037 avatar Mar 06 '23 12:03 1049451037

I'm not sure about this error - maybe @fmassa you know who would be the right POC there? Also, do you have a stacktrace for the error @1049451037 ?

danthe3rd avatar Mar 06 '23 16:03 danthe3rd

I am also getting this error

pmcvay avatar Jul 12 '23 15:07 pmcvay

Is it possible to disable the triton softmax kernel as a temporary workaround?

pmcvay avatar Jul 12 '23 20:07 pmcvay

Same problem here. How bad is this bug, i.e. can I simply ignore the "Triton Error"?

vladchimescu avatar Feb 13 '24 16:02 vladchimescu

I think triton launches on the current cuda device and you usually want the tensors you pass it to be in that device. This means you might want to change that device manually, unlike most simple pytorch operations which run on the device of their inputs.

So in the code above, you could maybe try replacing

out = model(q, k, v, att_mask=mask)

with

with torch.cuda.device(q.device):
    out = model(q, k, v, att_mask=mask)

bottler avatar Feb 13 '24 17:02 bottler