opencv
opencv copied to clipboard
the cv::cuda::transpose crash when CV_32F rows exceed 1048560
System Information
OpenCV: 4.10.0 compiler: clang 17.0.6 Platform: almalinux9 cuda sdk: 12.3
Detailed description
exception message: OpenCV(4.10.0) opencv-4.10.0/contrib/modules/cudev/include/opencv2/cudev/grid/detail/transpose.hpp:118: error: (-217:Gpu API call) invalid configuration argument in function 'transpose
Steps to reproduce
for (int i = 1048555; i <= 1048561; i += 1)
{
cv::Mat src(i, 4, CV_32F, cv::Scalar{2});
cv::Mat dst;
cv::RNG rng{};
rng.fill(
src,
cv::RNG::UNIFORM,
0,
200
);
// CPU works
cv::transpose(src, dst);
// GPU works
cv::cuda::GpuMat d_src(src);
cv::cuda::GpuMat d_dst;
cv::cuda::GpuMat d_dst2{src.cols, src.rows, src.type(), cv::Scalar{10}};
std::cout << cv::cuda::sum(d_src) << "| " << cv::cuda::sum(d_dst2) << std::endl;
cv::cuda::transpose(d_src, d_dst);
cv::cuda::transpose(d_src, d_dst2);
std::cout << cv::cuda::sum(d_src) << "| " << cv::cuda::sum(d_dst) << "| " << cv::cuda::sum(d_dst2) << std::endl;
// Check results
bool passed = cv::norm(dst - cv::Mat(d_dst), cv::NORM_INF) < 1e-3;
bool passed2 = cv::norm(dst - cv::Mat(d_dst2), cv::NORM_INF) < 1e-3;
std::cout<< "i=" << i << " dst without memory initalized:" << (passed ? "passed" : "FAILED") << std::endl;
std::cout<< "i=" << i << "dst with memory initalized:" << (passed2 ? "passed" : "FAILED") << std::endl;
// Deallocate data here, otherwise deallocation will be performed
// after context is extracted from the stack
d_src.release();
d_dst.release();
d_dst2.release();
std::cout << "released\n";
}
Issue submission checklist
- [X] I report the issue, it's not a question
- [X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
- [X] I updated to the latest OpenCV version and the issue is still there
- [X] There is reproducer code and related data files (videos, images, onnx, etc)
Your error is unrelated to cv::cuda::transpose and caused by cv::cuda::GpuMat::setTo. That said if you don't initialize d_dst2 to 10. i.e.
cv::cuda::GpuMat d_dst2{src.cols, src.rows, src.type()};
then you would recieve the same error from the transpose operation.
Both errors are caused by the same issue that the CUDA kernels are being launched with a y grid dimension greater than the maximum allowed by the hardware, 65535.
setTohas y block dim of 8 meaning it will fail with more that 8*65535 = rows andtransposehas a y block dim of 16 meaning it will fail with more than 16*65535 = rows
A possible solution here would be to use thread coarsening inside the corresponding kernels inside cudev to reduce the number of grids launched in the y direction.
hmm, ok, if this is the case, what do you suggest that we should use for HWC to CHW permuate? I guess we have to implement something by ourselves then. We can of course break it down to smaller ROI then transpose the ROI one by one or on different streams. But I think we probably should implement this in opencv
and also, this is policy from opencv right? It said default policy but can we some how change the policy to make it much larger?
// Default Policy
struct DefaultTransposePolicy
{
enum {
tile_dim = 16,
block_dim_y = 16
};
};
and also, this is policy from opencv right? It said default policy but can we some how change the policy to make it much larger?
// Default Policy struct DefaultTransposePolicy { enum { tile_dim = 16, block_dim_y = 16 }; };
Not really without effecting the performance of the underlying kernel and even then it won't make a big difference to the number of rows you can process, it would be better to process more elements per thread because you can significantly increase the number of rows without affecting performance.
hmm, ok, if this is the case, what do you suggest that we should use for HWC to CHW permuate? I guess we have to implement something by ourselves then. We can of course break it down to smaller ROI then transpose the ROI one by one or on different streams. But I think we probably should implement this in opencv
For performance it might be better to implement something custom yourselves because the sizes you are using don't fit well into the standard block sizes you will encounter (16x16, 32x16 etc.) and the 2D assumptions they rely on. e.g. If your data isn't pitched then you can improve your read coalesing significantly by reading several rows at once and potentially tailor your algorithm around this.
For a quick fix as I suggested you could use thread coarsening , see
https://github.com/opencv/opencv_contrib/compare/4.x...cudawarped:opencv_contrib:cuda_transpose_fix
for an example where I have added it for the transpose and setTo operations and removed the calls to NPP. Note the coarsening factor is fixed at 64 but you can increase it futher if you need to to accomodate the number of rows you are using.
Then when NPP fix their restriction and https://github.com/opencv/opencv_contrib/pull/3371 is merged you can revert your changes.