opencv_contrib
opencv_contrib copied to clipboard
How to evaluate the GPU memory usage for video decoding with cuda?
- OpenCV => 4.7.0
- Operating System / Platform =>Linux
Detailed description
I build a opencv_contrib_python with cuvid. And use it to pull the rtsps stream
2560x1920 [SAR 1:1 DAR 4:3], 20 fps, 20 tbr, 90k tbn, 40 tbc
And i find that it cost 391MB GPU memory.
@cudawarped How to evaluate the GPU memory usage for video decoding with cuda?
And i find that it cost 391MB GPU memory.
@cudawarped How to evaluate the GPU memory usage for video decoding with cuda?
I am not 100% sure what you are asking because your question implies that you have already evaluated the amount of device memory required for your video decoding session. That said I will make the assumption that you are asking why so much memory is used and what for.
The 391MB value you have mentioned is composed of both the memory to create the CUDA context and the memory to create an instance of cv::cudacodec::VideoReader with the former most likely requiring the largest proportion of this memory. In python I don't know of a way to determine the amount of memory used to create a CUDA context but I assume there are many libraries out there which can perfom this task. In C++ you would use cudaMemGetInfo() and not cv::DeviceInfo::freeMemory() as this creates a CUDA context before making the calculation.
The second much smaller portion of the memory is mainly determined by the number of internal decode surfaces required to decode the video source which will depend on the codec and parameters used to encode the source.
If your interested I have written a small example demonstrating this which can be found here.
@cudawarped Wow,thank you very much. Amazing I am really interested about this.
As you said that The minimum number of decode surfaces is determined by the video source and can be increased to increase decoding performance.
There are some questions below:
- How to get the number of decode surfaces i can use? I found the default is 5 on my machine but 4 in your example.
- What is the right way to allocate decode surfaces by
minNumDecodeSurfaces? To grab a frmae and get the format info ,then double? - What is the maximum number of decode surfaces?
- If i have two video stream to be pulled, can they use the same CUDA CONTEXT? For now,i need to create two VideoReader.
- Also i use
ffmpeg -hwaccel_output_format cuda -y -f rawvideo -pix_fmt bgr24 -s 2560x1920 -i - -c:v h264_nvenc -preset ll -pix_fmt yuv420p -f rtsp -rtsp_transport tcp rtsps://xxxxxxxto push the frame i processed, so how increase the number of encode surfaces to increase encoding performance by ffmpeg cmd.
- How to get the number of decode surfaces i can use? I found the default is 5 on my machine but 4 in your example.
The minimum number of decode surfaces is determined internally by the Nvidia Video Codec SDK. Decoding performance can be increased by increasing this number, however I would not expect any increase in performance when streaming from RTSP as you will not be saturating the hardware decoding unit. Where you will notice a difference is when decoding a high resolution video file as quickly as possible, with the default you may only get 60% decoding utilization which can be increased to 100% when enough surfaces are chosen.
If you are interested you can read the documentation. I have taken pasted two of the most relavent parts below.
ulNumDecodeSurfaces: Referred to as decode surfaces elsewhere in this document, this is the number of surfaces that the driver will internally allocate for storing the decoded frames. Using a higher number ensures better pipelining but increases GPU memory consumption. For correct operation, minimum value is defined in CUVIDEOFORMAT::min_num_decode_surfaces and can be obtained from first sequence callback from Nvidia parser. The NVDEC engine writes decoded data to one of these surfaces. These surfaces are not accessible by the user of NVDECODE API, but the mapping stage, which includes decoder output format conversion, scaling, cropping etc.) use these surfaces as input surfaces.
This will ensure that the underlying driver allocates minimum number of decode surfaces to correctly decode the sequence. In case there is reduction in decoder performance, clients can slightly increase CUVIDDECODECREATEINFO::ulNumDecodeSurfaces. It is therefore recommended to choose the optimal value of CUVIDDECODECREATEINFO::ulNumDecodeSurfaces to ensure right balance between decoder throughput and memory consumption.
- If i have two video stream to be pulled, can they use the same CUDA CONTEXT? For now,i need to create two VideoReader.
Yes.
5. Also i use
ffmpeg -hwaccel_output_format cuda -y -f rawvideo -pix_fmt bgr24 -s 2560x1920 -i - -c:v h264_nvenc -preset ll -pix_fmt yuv420p -f rtsp -rtsp_transport tcp rtsps://xxxxxxxto push the frame i processed, so how increase the number of encode surfaces to increase encoding performance by ffmpeg cmd.
I am not sure that you can.
Thank you again.
Another questions:
My opencv and ffmpeg is built with Video_Codec_SDK_11.1.5 and https://github.com/FFmpeg/nv-codec-headers/tree/n11.1.5.1
when i use ffmpeg -hwaccel_output_format cuda -y -f rawvideo -pix_fmt bgr24 -s 2560x1920 -i - -c:v h264_nvenc -preset p6 -pix_fmt yuv420p -f rtsp -rtsp_transport tcp rtsps://xxxxxxx
i got
[h264_nvenc @ 0x5580726ffe40] [Eval @ 0x7ffddc517ff0] Undefined constant or missing '(' in 'p6'
[h264_nvenc @ 0x5580726ffe40] Unable to parse option value "p6"
[h264_nvenc @ 0x5580726ffe40] Error setting option preset to value p6.
Why? And how to get the right option for preset
Why? And how to get the right option for
preset
I don't know, additionaly this question is off topic and unrelated to OpenCV.