lpms
lpms copied to clipboard
P3 - Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN
debug CUDA_ERROR_UNKNOWN errors Why? Should follow up, but hard to debug until P2s are addressed and seem to have stopped.
Describe the bug
The GPU video decoding fails with CUDA_ERROR_UNKNOWN
, needing the user to restart the node for future segments. Sometimes it's paired with CUDA_ERROR_OUT_OF_MEMORY
or CUDA_ERROR_ILLEGAL_ADDRESS
.
To Reproduce Steps to reproduce the behavior:
- Unclear as of now.
Expected behavior Decrease the blast radius of these errors if possible, and figure out the root cause.
Screenshots
ERROR_UNKNOWN
ERROR_ILLEGAL_ADDRESS
ERROR_OUT_OF_MEMORY
Additional context
Stack-trace for future reference: LPMS - https://github.com/livepeer/lpms/blob/master/ffmpeg/decoder.c#L250 FFmpeg - entry-point https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext.c#L610 most-probable line causing the error https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext.c#L629 cuda-specific ctx creation routine https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext_cuda.c#L379 cuCtxCreate call https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext_cuda.c#L363
Update here:
- Our current understanding is that the process only needs to be restarted if there is a CUDA_ERROR_ILLEGAL_ADDRESS error (related #356)
- We observed CUDA_ERROR_UNKNOWN (not paired with an OOM or illegal address error) again recently in prod, but we don't have repro steps right now
Hi,
I'm facing exactly same issue with livepeer V0.5.35, Nvidia driver 525.78.01, CUDA version: 12.0 and power state P0.
It happens on some streams, not all but I hit 28 times this error on last 24h.
hi @yondonfu ,
I've a solution for this, not perfect one, but when it happens I invoke transcoder service again. So basically there are two transcoders and one will be terminated automatically in 3-4 secs. but this prevents me to loose streams because of this CUDA error. Please consider this, or a better implementation, for next releases :)