cccl [FEA]: the BinaryFunction in thrust::reduce must be defined outside the main function?

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

Thrust

Is your feature request related to a problem? Please describe.

I found the code

#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
#include <cuda_runtime.h>
#include <vector_types.h>

int main() {
    const int N = 100;
    
    // Allocate and initialize device memory
    float3* d_data;
    cudaMalloc(&d_data, N * sizeof(float3));
    
    // Create Thrust device pointer
    thrust::device_ptr<float3> d_ptr(d_data);
    
    // Initialize data (example: simple pattern)
    for(int i = 0; i < N; i++) {
        float3 val = make_float3(i, i*2, i*3);
        cudaMemcpy(&d_data[i], &val, sizeof(float3), cudaMemcpyHostToDevice);
    }
    struct float3_add {
        __host__ __device__
        float3 operator()(const float3& a, const float3& b) const {
            return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
        }
    };
    // Perform reduction
    float3 sum = thrust::reduce(d_ptr, d_ptr + N, make_float3(0,0,0), float3_add());
    
    // Print result (automatically copied to host)
    printf("Sum: (%f, %f, %f)\n", sum.x, sum.y, sum.z);
    
    cudaFree(d_data);
    return 0;
}

can't work correctly. But

#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
#include <cuda_runtime.h>
#include <vector_types.h>

struct float3_add {
    __host__ __device__
    float3 operator()(const float3& a, const float3& b) const {
        return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
    }
};

int main() {
    const int N = 100;
    
    // Allocate and initialize device memory
    float3* d_data;
    cudaMalloc(&d_data, N * sizeof(float3));
    
    // Create Thrust device pointer
    thrust::device_ptr<float3> d_ptr(d_data);
    
    // Initialize data (example: simple pattern)
    for(int i = 0; i < N; i++) {
        float3 val = make_float3(i, i*2, i*3);
        cudaMemcpy(&d_data[i], &val, sizeof(float3), cudaMemcpyHostToDevice);
    }

    // Perform reduction
    float3 sum = thrust::reduce(d_ptr, d_ptr + N, make_float3(0,0,0), float3_add());
    
    // Print result (automatically copied to host)
    printf("Sum: (%f, %f, %f)\n", sum.x, sum.y, sum.z);
    
    cudaFree(d_data);
    return 0;
}

can work correctly.

Describe the solution you'd like

why can't i define float3_add inside the main function?

Describe alternatives you've considered

No response

Additional context

No response

Apr 27 '25 00:04 qqwqqw689

I believe the problem is that the local type is defined in a __host__ only function and is not visible during __device__ compilation. I've slightly modified the provided example here: https://godbolt.org/z/xsxq9Y9cz

You can see there are warnings emited by nvcc about using local types in __global__ functions which causes that the kernel cannot be launched (ignore the warnings about using cudaMalloc, cudaFree and cudaMemcpy in __device__ function).

However, if you use an extended lambda instead of a callable type, the code works fine, see: https://godbolt.org/z/1ff5qKPsP

Apr 27 '25 04:04 davebayer

@davebayer is correct here, you need to ensure that the functor is visible for the device pass.

Apr 28 '25 07:04 miscco

@miscco @davebayer thanks!

Apr 28 '25 07:04 qqwqqw689