opencv the cv::cuda::transpose crash when CV

trafficstars

System Information

OpenCV: 4.10.0 compiler: clang 17.0.6 Platform: almalinux9 cuda sdk: 12.3

Detailed description

exception message: OpenCV(4.10.0) opencv-4.10.0/contrib/modules/cudev/include/opencv2/cudev/grid/detail/transpose.hpp:118: error: (-217:Gpu API call) invalid configuration argument in function 'transpose

Steps to reproduce

for (int i = 1048555; i <= 1048561; i += 1)
    {
        cv::Mat src(i, 4, CV_32F, cv::Scalar{2});
        cv::Mat dst;
        cv::RNG rng{};
        rng.fill(
            src,
            cv::RNG::UNIFORM,
            0,
            200
        );

        // CPU works
        cv::transpose(src, dst);

        // GPU works
        cv::cuda::GpuMat d_src(src);
        cv::cuda::GpuMat d_dst;
        cv::cuda::GpuMat d_dst2{src.cols, src.rows, src.type(), cv::Scalar{10}};
        std::cout << cv::cuda::sum(d_src) << "| " << cv::cuda::sum(d_dst2) << std::endl; 
        cv::cuda::transpose(d_src, d_dst);
        cv::cuda::transpose(d_src, d_dst2);
        std::cout << cv::cuda::sum(d_src) << "| " << cv::cuda::sum(d_dst) << "| " << cv::cuda::sum(d_dst2) << std::endl; 
        // Check results
        bool passed = cv::norm(dst - cv::Mat(d_dst), cv::NORM_INF) < 1e-3;
        bool passed2 = cv::norm(dst - cv::Mat(d_dst2), cv::NORM_INF) < 1e-3;
        std::cout<< "i=" << i << " dst without memory initalized:" << (passed ? "passed" : "FAILED") << std::endl;
        std::cout<< "i=" << i <<  "dst with memory initalized:" << (passed2 ? "passed" : "FAILED") << std::endl;

        // Deallocate data here, otherwise deallocation will be performed
        // after context is extracted from the stack
        d_src.release();
        d_dst.release();
        d_dst2.release();
        std::cout << "released\n";
    }

Issue submission checklist

[X] I report the issue, it's not a question
[X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
[X] I updated to the latest OpenCV version and the issue is still there
[X] There is reproducer code and related data files (videos, images, onnx, etc)

Oct 16 '24 13:10 braindevices

Your error is unrelated to cv::cuda::transpose and caused by cv::cuda::GpuMat::setTo. That said if you don't initialize d_dst2 to 10. i.e.

cv::cuda::GpuMat d_dst2{src.cols, src.rows, src.type()};

then you would recieve the same error from the transpose operation.

Both errors are caused by the same issue that the CUDA kernels are being launched with a y grid dimension greater than the maximum allowed by the hardware, 65535.

setTo has y block dim of 8 meaning it will fail with more that 8*65535 = rows and
transpose has a y block dim of 16 meaning it will fail with more than 16*65535 = rows

A possible solution here would be to use thread coarsening inside the corresponding kernels inside cudev to reduce the number of grids launched in the y direction.

Oct 16 '24 13:10 cudawarped

hmm, ok, if this is the case, what do you suggest that we should use for HWC to CHW permuate? I guess we have to implement something by ourselves then. We can of course break it down to smaller ROI then transpose the ROI one by one or on different streams. But I think we probably should implement this in opencv

Oct 16 '24 17:10 braindevices

and also, this is policy from opencv right? It said default policy but can we some how change the policy to make it much larger?


// Default Policy

struct DefaultTransposePolicy
{
    enum {
        tile_dim    = 16,
        block_dim_y = 16
    };
};

Oct 16 '24 17:10 braindevices

and also, this is policy from opencv right? It said default policy but can we some how change the policy to make it much larger?
// Default Policy

struct DefaultTransposePolicy
{
    enum {
        tile_dim    = 16,
        block_dim_y = 16
    };
};

Not really without effecting the performance of the underlying kernel and even then it won't make a big difference to the number of rows you can process, it would be better to process more elements per thread because you can significantly increase the number of rows without affecting performance.

Oct 16 '24 18:10 cudawarped

hmm, ok, if this is the case, what do you suggest that we should use for HWC to CHW permuate? I guess we have to implement something by ourselves then. We can of course break it down to smaller ROI then transpose the ROI one by one or on different streams. But I think we probably should implement this in opencv

For performance it might be better to implement something custom yourselves because the sizes you are using don't fit well into the standard block sizes you will encounter (16x16, 32x16 etc.) and the 2D assumptions they rely on. e.g. If your data isn't pitched then you can improve your read coalesing significantly by reading several rows at once and potentially tailor your algorithm around this.

For a quick fix as I suggested you could use thread coarsening , see https://github.com/opencv/opencv_contrib/compare/4.x...cudawarped:opencv_contrib:cuda_transpose_fix for an example where I have added it for the transpose and setTo operations and removed the calls to NPP. Note the coarsening factor is fixed at 64 but you can increase it futher if you need to to accomodate the number of rows you are using.

Then when NPP fix their restriction and https://github.com/opencv/opencv_contrib/pull/3371 is merged you can revert your changes.

Oct 17 '24 14:10 cudawarped

opencv
opencv copied to clipboard

the cv::cuda::transpose crash when CV_32F rows exceed 1048560

System Information

Detailed description

Steps to reproduce

Issue submission checklist

opencv opencv copied to clipboard

the cv::cuda::transpose crash when CV_32F rows exceed 1048560

System Information

Detailed description

Steps to reproduce

Issue submission checklist

opencv
opencv copied to clipboard