lpms icon indicating copy to clipboard operation
lpms copied to clipboard

P3 - Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN

Open jailuthra opened this issue 3 years ago • 3 comments

debug CUDA_ERROR_UNKNOWN errors Why? Should follow up, but hard to debug until P2s are addressed and seem to have stopped.

Describe the bug The GPU video decoding fails with CUDA_ERROR_UNKNOWN, needing the user to restart the node for future segments. Sometimes it's paired with CUDA_ERROR_OUT_OF_MEMORY or CUDA_ERROR_ILLEGAL_ADDRESS.

To Reproduce Steps to reproduce the behavior:

  • Unclear as of now.

Expected behavior Decrease the blast radius of these errors if possible, and figure out the root cause.

Screenshots ERROR_UNKNOWN image

ERROR_ILLEGAL_ADDRESS image

ERROR_OUT_OF_MEMORY image

Additional context

Stack-trace for future reference: LPMS - https://github.com/livepeer/lpms/blob/master/ffmpeg/decoder.c#L250 FFmpeg - entry-point https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext.c#L610 most-probable line causing the error https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext.c#L629 cuda-specific ctx creation routine https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext_cuda.c#L379 cuCtxCreate call https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext_cuda.c#L363

jailuthra avatar Jun 15 '21 12:06 jailuthra

Update here:

  • Our current understanding is that the process only needs to be restarted if there is a CUDA_ERROR_ILLEGAL_ADDRESS error (related #356)
  • We observed CUDA_ERROR_UNKNOWN (not paired with an OOM or illegal address error) again recently in prod, but we don't have repro steps right now

yondonfu avatar Nov 30 '22 14:11 yondonfu

Hi,

I'm facing exactly same issue with livepeer V0.5.35, Nvidia driver 525.78.01, CUDA version: 12.0 and power state P0.

It happens on some streams, not all but I hit 28 times this error on last 24h.

boratuncer avatar Jan 08 '23 11:01 boratuncer

hi @yondonfu ,

I've a solution for this, not perfect one, but when it happens I invoke transcoder service again. So basically there are two transcoders and one will be terminated automatically in 3-4 secs. but this prevents me to loose streams because of this CUDA error. Please consider this, or a better implementation, for next releases :)

boratuncer avatar Jan 13 '23 10:01 boratuncer