triton
triton copied to clipboard
Tensor on cuda device 1 cannot be accessed from Triton (cpu tensor?)
The code of softmax below is coppied from tutorials to demonstrate that we cannot pass tensors on devices other than "cuda:0" to triton kernel.
Errors are:
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
import torch
import triton
import triton.language as tl
@triton.jit
def softmax_kernel(
output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols,
BLOCK_SIZE: tl.constexpr
):
# The rows of the softmax are independent, so we parallelize across those
row_idx = tl.program_id(0)
# The stride represents how much we need to increase the pointer to advance 1 row
row_start_ptr = input_ptr + row_idx * input_row_stride
# The block size is the next power of two greater than n_cols, so we can fit each
# row in a single block
col_offsets = tl.arange(0, BLOCK_SIZE)
input_ptrs = row_start_ptr + col_offsets
# Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols
row = tl.load(input_ptrs, mask=col_offsets < n_cols, other=-float('inf'))
# Subtract maximum for numerical stability
row_minus_max = row - tl.max(row, axis=0)
# Note that exponentiation in Triton is fast but approximate (i.e., think __expf in CUDA)
numerator = tl.exp(row_minus_max)
denominator = tl.sum(numerator, axis=0)
softmax_output = numerator / denominator
# Write back output to DRAM
output_row_start_ptr = output_ptr + row_idx * output_row_stride
output_ptrs = output_row_start_ptr + col_offsets
tl.store(output_ptrs, softmax_output, mask=col_offsets < n_cols)
def softmax(x):
n_rows, n_cols = x.shape
# The block size is the smallest power of two greater than the number of columns in `x`
BLOCK_SIZE = triton.next_power_of_2(n_cols)
# Another trick we can use is to ask the compiler to use more threads per row by
# increasing the number of warps (`num_warps`) over which each row is distributed.
# You will see in the next tutorial how to auto-tune this value in a more natural
# way so you don't have to come up with manual heuristics yourself.
num_warps = 4
if BLOCK_SIZE >= 2048:
num_warps = 8
if BLOCK_SIZE >= 4096:
num_warps = 16
# Allocate output
y = torch.empty_like(x)
# Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
# f the input matrix
softmax_kernel[(n_rows,)](
y,
x,
x.stride(0),
y.stride(0),
n_cols,
num_warps=num_warps,
BLOCK_SIZE=BLOCK_SIZE,
)
return y
x = torch.randn(128, 6800, device="cuda:1")
out = softmax(x)
But the input tensor is not a cpu tensor.
What upsets me more is that I've tried using triton kernel with accelerate
(https://pypi.org/project/accelerate/). At that case, there would be some cases where tensors on "cuda:1" or so being used as inputs to triton kernels. Even though it ran(I don't know why it could run), the result was wrong(or undefined behavior? sometimes it was right, and sometimes the result was totally wrong, though on other kernels, like a flash attention). I confirmed that by dumping the input and output, and load the tensors and run it again on "cuda:0", the new results were different than the saved ones.
Is there some necessary context initialization step before tensors on "cuda:1" to be passed to triton kernels?
We found that the error raises from getPointer
(a function defined in string template) in python/triton/compiler/make_launcher.py
.
int status = cuPointerGetAttribute(&dev_ptr, CU_POINTER_ATTRIBUTE_DEVICE_POINTER, ptr_info.dev_ptr);
if (status == CUDA_ERROR_INVALID_VALUE) {{
PyErr_Format(PyExc_ValueError,
"Pointer argument (at %d) cannot be accessed from Triton (cpu tensor?)", idx);
ptr_info.valid = false;
}}
Error code CUDA_ERROR_INVALID_VALUE
from cuPointerGetAttribute
with key CU_POINTER_ATTRIBUTE_DEVICE_POINTER
indicate that
there exists no device pointer value through which kernels running in the current CUcontext may access ptr
And I guess while working with accelerate
, there happens to be valid device pointer on "cuda:0" so the kernel runs, but with wrong results.
As a work around. we can use torch.cuda.set_device
to switch device context. Then the kernel runs as expected.
x = torch.randn(128, 6800, device="cuda:1")
torch.cuda.set_device(x.device.index) # swicth context
out = softmax(x)
torch.cuda.set_device(0) # swicth back
But infer the correct device to run the kernel on from the inputs is more intuitive. After all, torch.softmax
does not require manually switching device.
+1 this is effecting me as well. Seems to normally work on cuda:0, also is stochastic on cuda != 0. ~Also your fix of set_device with torch.cuda
does not work for me.~ (actually was failing on a different kernel, just had to wrap that one as well.)
Thank you for the fix @iclementine. Anyone looking for a recipe, you just need to wrap your triton kernels in
with torch.cuda.device(x.device):
out = kernel[(..)](...)
Would be great to have a section on this in the docs since this issue is really undiscoverable.
+1 this is effecting me as well. Seems to normally work on cuda:0, also is stochastic on cuda != 0. ~Also your fix of set_device with
torch.cuda
does not work for me.~ (actually was failing on a different kernel, just had to wrap that one as well.)Thank you for the fix @iclementine. Anyone looking for a recipe, you just need to wrap your triton kernels in
with torch.cuda.device(x.device): out = kernel[(..)](...)
Would be great to have a section on this in the docs since this issue is really undiscoverable.
Yes, using context manager is better for this.
I found a workaround. I was using whisper on my second gpu and run into the same problem when using parameter --device cuda:1
My solution was to use environment to pass the right device:
CUDA_VISIBLE_DEVICES=1 whisper --device cuda
This works correctly.