cccl Implement `std::format` and `std::print`

trafficstars

printf() is available in CUDA, but it has several deficiencies.

The data type specifiers in the format string must agree with the types of the variadic arguments (a DRY failure).
printf() does not support overloads to support user-defined types.
printing across multiple statements has undefined interleaving with other threads, limiting composability

This request is to add a similar printing facility as std::ostream that is accessible in device code. This should enable user-defined printing functions and provide type safety. Ideally, delimiter tokens analogous to std::flush and std::endl would enable the CUDA driver to interleave the output from CUDA threads without corruption.

Example usage:

struct Foo {
  int member;
};

struct Bar {
  Foo foo;
  char const *name;
};

// Prints an object of type Foo.
__device__ cu::cout &operator<<(cu::cout &out, Foo const &foo) {
  return out << foo.member;
}

// Prints an object of type Bar.
__device__ cu::cout &operator<<(cu::cout &out, Bar const &bar) {
  return out << bar.foo << " " << name;
}

// Sample usage
__global__ void kernel() {

  Bar bar;

  cu::cout << "Thread " << threadIdx.x << ": " << bar << cu::endl;

  __syncthreads();

  cu::cout << "This is a multistatement ";
  cu::cout << "output block. ";
  cu::cout << "The index of the current thread is " << threadIdx.x;

  if (threadIdx.x & 1) {
    cu::cout << ", and it contains a bar of value " << bar;
  }

  cu::cout << ". The output appears contiguous despite control flow and printing over multiple statements.\n";

  cu::cout << cu::flush;
}

// Generic template kernel printing  all objects. Requires `operator<<(cu::ostream &, T const &rhs)` to exist in scope
template <typename T>
__global__ void generic_print(T *objects) {
  cu::cout << "T" << threadIdx.x << ": " << objects[threadIdx.x] << cu::endl;
}

Sep 21 '20 03:09 kerrmudgeon

May I expect any update on this? std::cout in device code would be a sweet feature to facilitate debugging.

Jul 30 '21 07:07 donglinz

Is there any interest in a sweet feature like this?

Note, std::format or similar would be a perfectly viable alternative. Even a non-standard solution to enable seemingly atomic printing would be welcome.

Thanks for any consideration.

Oct 14 '21 20:10 kerrmudgeon

Definitely interest from another user here for this sweet feature 😄

Oct 14 '21 20:10 maddyscientist

It's on our roadmap, we just need to prioritize it. ;) The more people that ask, the easier it would be for us to bump it into the next couple releases.

Oct 14 '21 21:10 wmaxey

I think std::format is going to be much more feasible than std::cout.

Oct 14 '21 22:10 jrhemstad

@wmaxey I third this. Most C++ beginners prefer using std::cout and std::cerr to std::format IMHO. The latter is kinda Python-like (or is it the other way around?)

Jan 26 '22 11:01 wyphan

I also sincerely hope that std::cout can be used in a CUDA kernel function.

Nov 29 '22 11:11 songyuc

Hijacking this issue to be about format and std::print now. cout isn't going to be feasible any time soon.

Feb 23 '23 16:02 jrhemstad

Would be great if libcudacxx offers a solution for device-side logging that this request would easily enable, instead of each team rolling out its own solution.

We’re interested in this too, as we prefer logging over issuing __trap when something wrong but not fatal is detected in a kernel.

May 24 '23 15:05 leofang

In Spiral, I've created my own printing functionality that is typesafe and can print arbitrary type in the language. It also uses a global semaphore to ensure that only one thread is sending data to the terminal at the time. Furthermore, the functions can be used interchangeably on both Python and the Cuda side.

The difficulty I am having is that there is no way to do IO redirection on the Python side. It only ever shows the data in the terminal. In fact, I asked about this on the Cuda dev support page, and this is the reply that I got by Yuki Ni.

Here is the answer and suggestion from our python engineering team , CUDA does not offer a way to redirect stdout from the device side. printf works as is. There has been sporadic conversations with the CCCL team on adding device-side logging support, please voice up there to gain attention: https://github.com/NVIDIA/cccl/issues/939

If a status report (e.g. progress bar) is needed to happen periodically from the device side, I believe an alternative approach is to do an in-kernel atomic-write to host pinned memory, and have an independent host thread polling the value written by the device. Then, the host code has full control over logging & I/O stream redirection.

More than just sending text, I wish Cuda had support for channels so that we could send arbitrary data to the host without the need to terminate the kernel.

Aug 29 '24 12:08 mrakgr

Finally saw it https://regulatedsantiago.github.io/3NuxfmLFtH7bpKUCwUhocFZvicAX7oW/

Oct 16 '25 22:10 mat-chip

cccl cccl copied to clipboard

Implement `std::format` and `std::print`

cccl
cccl copied to clipboard