triton Tensor on cuda device 1 cannot be accessed from Triton (cpu tensor?)

The code of softmax below is coppied from tutorials to demonstrate that we cannot pass tensors on devices other than "cuda:0" to triton kernel.

Errors are:

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

import torch

import triton
import triton.language as tl

@triton.jit
def softmax_kernel(
    output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols,
    BLOCK_SIZE: tl.constexpr
):
    # The rows of the softmax are independent, so we parallelize across those
    row_idx = tl.program_id(0)
    # The stride represents how much we need to increase the pointer to advance 1 row
    row_start_ptr = input_ptr + row_idx * input_row_stride
    # The block size is the next power of two greater than n_cols, so we can fit each
    # row in a single block
    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets
    # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols
    row = tl.load(input_ptrs, mask=col_offsets < n_cols, other=-float('inf'))
    # Subtract maximum for numerical stability
    row_minus_max = row - tl.max(row, axis=0)
    # Note that exponentiation in Triton is fast but approximate (i.e., think __expf in CUDA)
    numerator = tl.exp(row_minus_max)
    denominator = tl.sum(numerator, axis=0)
    softmax_output = numerator / denominator
    # Write back output to DRAM
    output_row_start_ptr = output_ptr + row_idx * output_row_stride
    output_ptrs = output_row_start_ptr + col_offsets
    tl.store(output_ptrs, softmax_output, mask=col_offsets < n_cols)


def softmax(x):
    n_rows, n_cols = x.shape
    # The block size is the smallest power of two greater than the number of columns in `x`
    BLOCK_SIZE = triton.next_power_of_2(n_cols)
    # Another trick we can use is to ask the compiler to use more threads per row by
    # increasing the number of warps (`num_warps`) over which each row is distributed.
    # You will see in the next tutorial how to auto-tune this value in a more natural
    # way so you don't have to come up with manual heuristics yourself.
    num_warps = 4
    if BLOCK_SIZE >= 2048:
        num_warps = 8
    if BLOCK_SIZE >= 4096:
        num_warps = 16
    # Allocate output
    y = torch.empty_like(x)
    # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
    # f the input matrix
    softmax_kernel[(n_rows,)](
        y,
        x,
        x.stride(0),
        y.stride(0),
        n_cols,
        num_warps=num_warps,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    return y


x = torch.randn(128, 6800, device="cuda:1")
out = softmax(x)

But the input tensor is not a cpu tensor.

What upsets me more is that I've tried using triton kernel with accelerate (https://pypi.org/project/accelerate/). At that case, there would be some cases where tensors on "cuda:1" or so being used as inputs to triton kernels. Even though it ran(I don't know why it could run), the result was wrong(or undefined behavior? sometimes it was right, and sometimes the result was totally wrong, though on other kernels, like a flash attention). I confirmed that by dumping the input and output, and load the tensors and run it again on "cuda:0", the new results were different than the saved ones.

Oct 04 '23 05:10 iclementine

Is there some necessary context initialization step before tensors on "cuda:1" to be passed to triton kernels?

Oct 07 '23 03:10 iclementine

We found that the error raises from getPointer (a function defined in string template) in python/triton/compiler/make_launcher.py.

    int status = cuPointerGetAttribute(&dev_ptr, CU_POINTER_ATTRIBUTE_DEVICE_POINTER, ptr_info.dev_ptr);
    if (status == CUDA_ERROR_INVALID_VALUE) {{
        PyErr_Format(PyExc_ValueError,
                     "Pointer argument (at %d) cannot be accessed from Triton (cpu tensor?)", idx);
        ptr_info.valid = false;
    }}

Error code CUDA_ERROR_INVALID_VALUE from cuPointerGetAttribute with key CU_POINTER_ATTRIBUTE_DEVICE_POINTER indicate that

there exists no device pointer value through which kernels running in the current CUcontext may access ptr

And I guess while working with accelerate, there happens to be valid device pointer on "cuda:0" so the kernel runs, but with wrong results.

As a work around. we can use torch.cuda.set_device to switch device context. Then the kernel runs as expected.

x = torch.randn(128, 6800, device="cuda:1")
torch.cuda.set_device(x.device.index) # swicth context
out = softmax(x)
torch.cuda.set_device(0) # swicth back

But infer the correct device to run the kernel on from the inputs is more intuitive. After all, torch.softmax does not require manually switching device.

Oct 07 '23 04:10 iclementine

+1 this is effecting me as well. Seems to normally work on cuda:0, also is stochastic on cuda != 0. ~Also your fix of set_device with torch.cuda does not work for me.~ (actually was failing on a different kernel, just had to wrap that one as well.)

Thank you for the fix @iclementine. Anyone looking for a recipe, you just need to wrap your triton kernels in

with torch.cuda.device(x.device):
  out = kernel[(..)](...)

Would be great to have a section on this in the docs since this issue is really undiscoverable.

Jan 30 '24 23:01 bobakfb

+1 this is effecting me as well. Seems to normally work on cuda:0, also is stochastic on cuda != 0. ~Also your fix of set_device with torch.cuda does not work for me.~ (actually was failing on a different kernel, just had to wrap that one as well.)

Thank you for the fix @iclementine. Anyone looking for a recipe, you just need to wrap your triton kernels in
with torch.cuda.device(x.device):
  out = kernel[(..)](...)
Would be great to have a section on this in the docs since this issue is really undiscoverable.

Yes, using context manager is better for this.

Apr 09 '24 03:04 iclementine

I found a workaround. I was using whisper on my second gpu and run into the same problem when using parameter --device cuda:1 My solution was to use environment to pass the right device: CUDA_VISIBLE_DEVICES=1 whisper --device cuda This works correctly.

Apr 14 '24 17:04 MrDolch

triton triton copied to clipboard

Tensor on cuda device 1 cannot be accessed from Triton (cpu tensor?)

triton
triton copied to clipboard