[FEA]: the BinaryFunction in thrust::reduce must be defined outside the main function?
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
Thrust
Is your feature request related to a problem? Please describe.
I found the code
#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
#include <cuda_runtime.h>
#include <vector_types.h>
int main() {
const int N = 100;
// Allocate and initialize device memory
float3* d_data;
cudaMalloc(&d_data, N * sizeof(float3));
// Create Thrust device pointer
thrust::device_ptr<float3> d_ptr(d_data);
// Initialize data (example: simple pattern)
for(int i = 0; i < N; i++) {
float3 val = make_float3(i, i*2, i*3);
cudaMemcpy(&d_data[i], &val, sizeof(float3), cudaMemcpyHostToDevice);
}
struct float3_add {
__host__ __device__
float3 operator()(const float3& a, const float3& b) const {
return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
}
};
// Perform reduction
float3 sum = thrust::reduce(d_ptr, d_ptr + N, make_float3(0,0,0), float3_add());
// Print result (automatically copied to host)
printf("Sum: (%f, %f, %f)\n", sum.x, sum.y, sum.z);
cudaFree(d_data);
return 0;
}
can't work correctly. But
#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
#include <cuda_runtime.h>
#include <vector_types.h>
struct float3_add {
__host__ __device__
float3 operator()(const float3& a, const float3& b) const {
return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
}
};
int main() {
const int N = 100;
// Allocate and initialize device memory
float3* d_data;
cudaMalloc(&d_data, N * sizeof(float3));
// Create Thrust device pointer
thrust::device_ptr<float3> d_ptr(d_data);
// Initialize data (example: simple pattern)
for(int i = 0; i < N; i++) {
float3 val = make_float3(i, i*2, i*3);
cudaMemcpy(&d_data[i], &val, sizeof(float3), cudaMemcpyHostToDevice);
}
// Perform reduction
float3 sum = thrust::reduce(d_ptr, d_ptr + N, make_float3(0,0,0), float3_add());
// Print result (automatically copied to host)
printf("Sum: (%f, %f, %f)\n", sum.x, sum.y, sum.z);
cudaFree(d_data);
return 0;
}
can work correctly.
Describe the solution you'd like
why can't i define float3_add inside the main function?
Describe alternatives you've considered
No response
Additional context
No response
I believe the problem is that the local type is defined in a __host__ only function and is not visible during __device__ compilation. I've slightly modified the provided example here: https://godbolt.org/z/xsxq9Y9cz
You can see there are warnings emited by nvcc about using local types in __global__ functions which causes that the kernel cannot be launched (ignore the warnings about using cudaMalloc, cudaFree and cudaMemcpy in __device__ function).
However, if you use an extended lambda instead of a callable type, the code works fine, see: https://godbolt.org/z/1ff5qKPsP
@davebayer is correct here, you need to ensure that the functor is visible for the device pass.
@miscco @davebayer thanks!