MediaSDK icon indicating copy to clipboard operation
MediaSDK copied to clipboard

Hung up in MFXVideoVPP_GetVPPStat after random number of seeks with FFMPEG

Open santomet opened this issue 3 years ago • 22 comments

System information

  • CPU information (link to ark.intel.com): 1035G4 (Surface Pro 7)
  • GPU information (link to ark.intel.com): -||-
  • Driver version: described in text

Issue behavior

Describe the current behavior

The h264_qsv in FFMPEG decoder freezes after random number of seeks with flushes (most of the times you need max 100 seeks). Does not happen when standard h264 or h264_cuvid decoder is used.

Note that this might be a bug not in Intel Media SDK, but in FFMPEG. I am not sure on which grave should I cry, but because there are developers from Intel actively enhancing the QSV support on the FFMPEG, i'll describe the issue both there and here.

Tested with both Windows-provided libmfxhw64.dll dispatcher (File Version 11.21.4.14) and from the latest driver (File Version 21.11.3.320) This is the debug call stack on my decoding thread when the freeze happens:

1 ZwWaitForAlertByThreadId ntdll 0x7fff7a530764
2 RtlSleepConditionVariableSRW ntdll 0x7fff7a4f4021
3 SleepConditionVariableSRW KERNELBASE 0x7fff77ecce89
4 MFXVideoVPP_GetVPPStat libmfxhw64 0x7fff027ef8cc
5 MFXVideoVPP_GetVPPStat libmfxhw64 0x7fff027ec7f4
6 MFXVideoVPP_GetVPPStat libmfxhw64 0x7fff027ec69f
7 MFXVideoVPP_GetVPPStat libmfxhw64 0x7fff023575b0
8 MFXVideoCORE_SyncOperation libmfxhw64 0x7fff02337a4d
9 av_qsv_alloc_context avcodec_59 0x7fff10eb4c06
10 av_qsv_alloc_context avcodec_59 0x7fff10eb573d
11 av_qsv_alloc_context avcodec_59 0x7fff10eb59c7
12 avcodec_default_get_buffer2 avcodec_59 0x7fff10a9f9b1
13 avcodec_send_packet avcodec_59 0x7fff10aa0590
14 libavDecoder::playLoopPrivate libavdecoder.cpp 724 0x7ff78094f6da

EDIT: The decoding always resumes after circa one minute.

I am using FFMPEG builds from BtbN - confirmed with one of the latest master update (2022-02-15 12:35)

Describe the expected behavior

Not freezing?

Code example

This is small excrept with the important parts to showcase how am I using FFMPEG for this:

Seek:

...........................
err = av_seek_frame(m.fmtctx, m.videoStream, [random-pts-in-media minus GOP interval], 0);
        if(err != 0) {
                av_strerror(err, errbuff, 256);
                std::cerr << "Error Seeking. AVERROR: " << errbuff  << std::endl;
        }

        ptsToFind = referencePts;
        seekMode = true;
        seekModeKeyframeFound = false;
        seekModeFlushed = false;

.........................

Read&Decode:

...........................
 //Reads the right amount of data from stream
        err = av_read_frame(m.fmtctx, pPacket);
        if(err < 0) {
            av_strerror(err, errbuff, 256);
            std::cerr << "AV READ FRAME error or EOF " << errbuff << " " << m.lastPts[m.videoStream] << std::endl;
            if(err == AVERROR_EOF) {
                emit endOfFile(m.lastPts[m.videoStream]);
            }
            av_packet_unref(pPacket);
            break;
        }

        //If we are seeking, flush the codec and reset the GOP counter (as we automatically found nearest I frame)
        if(seekMode) {

            if(!seekModeFlushed) {
                if(!(pPacket->flags & AV_PKT_FLAG_KEY)){
                    av_packet_unref(pPacket);
                    continue;
                }
                avcodec_send_packet(avcc, NULL); //flush
                avcodec_flush_buffers(avcc);
                actual_gop_counter = 0;
                seekModeFlushed = true;
            }
        }

        //Send the "right amount of data" to the decoder
        err = avcodec_send_packet(avcc, pPacket);
        if(err != 0) {
            av_strerror(err, errbuff, 256);
            std::cerr << "Error sending packet. AVERROR: " << errbuff << std::endl;
            av_packet_unref(pPacket);
            continue;
        }


        //Gather first ready frame from the internal queue
        err = avcodec_receive_frame(avcc, pFrame);
        if(err != 0) {
            if(err != AVERROR(EAGAIN)) { //resource can be unavailable pretty often, do not write out anything in such case
                av_strerror(err, errbuff, 256);
                std::cerr << "Error receiving frame. AVERROR: " << errbuff  << std::endl;
            }
            av_packet_unref(pPacket);
            av_frame_unref(pFrame);
            continue;
        }

        //Save data for average GOP computation
        if(pFrame->pict_type == AV_PICTURE_TYPE_I && actual_gop_counter != 0) {
            if(actual_gop_counter > max_gop) {
                max_gop = actual_gop_counter;
            }
            actual_gop_counter = 0;
        }
        ++actual_gop_counter;

        if(seekMode) {
            if(pFrame->pts >= ptsToFind && pFrame->pts <= ptsToFind + avg_pts*MAX_ALLOWED_SKIP_FRAMES) {
                seekMode = false;
                //Because QSV is a fucked-up decoder, it always returns exactly TWO frames from the previous
                //decoding chunk. And you can not flush it, it will only return them after you provide
                //two fresh packets. Really fucked-up. Really. Therefore we can not stay with
                //the check that if the pts is already the same or greater that the one we need to find.
                //We also have to introduce if it is not too far away. That is defined in MAX_ALLOWED_SKIP_FRAMES
            }
            else {
                av_frame_unref(pFrame);
                av_packet_unref(pPacket);
                continue; //do not proceed if still in seek
            }
        }

.............................

##Further comments Is there any way I can help more with hunting down this issue?

santomet avatar Feb 21 '22 15:02 santomet

Most likely this call stack shows wrong symbols so MFXVideoVPP_GetVPPStat () is a misleading function. Is there any way to share a reproducer with us?

dmitryermilov avatar Feb 21 '22 16:02 dmitryermilov

Two possibilities: I can either share compiled application privately with you (incl. instructions and video file) if the source code is not needed, or I can create shareable reproducer but it will take me a little. Would the first one be of some use for you?

santomet avatar Feb 21 '22 17:02 santomet

Hi @santomet , Private version is fine.

dmitryermilov avatar Feb 21 '22 20:02 dmitryermilov

Nice, can I use your work email?

santomet avatar Feb 22 '22 07:02 santomet

BTW, I'd suggest in parallel to consider another debug activity. As you know, ffmpeg QSV plug-in operates with MediaSDL. MediaSDK dev package has the API tracer (with per-frame logging). Can you please try to reproduce the issue under the tracer and share log file with me ? Please feel free to decide to use public or private channel. image

dmitryermilov avatar Mar 01 '22 14:03 dmitryermilov

Thanks for this! honestly, i didn't notice as at the time I tried to solve this, it was impossible to download the SDK due to some strange error. However, now i have it and I present two files:

tracer.log is a full situation including freeze and later the recovery.

tracer-duringfreeze.log is a copy from the time when the decoding has been frozen.

santomet avatar Mar 01 '22 18:03 santomet

Thank you. I guess you didn't click Per-frame logging because I don't see any DecodeFrameAsync calls (there's only DecodeHeader/Init/Close calls). Can you please double-check?

But even without it (but still full logs are strongly needed to see full picture) I see lots of DecodeHeader calls which return MFX_ERR_MORE_DATA. It means that decoder can't find SPS/PPS in provided bitstream.

My guess (and full logs would confirm it or not) is that when you make repositioning across the video file, ffmpeg calls either attempts to Close/Init decoder (and to init decoder we need to provide latest params obtained from DecodeHeader()) or ffmpeg "just" calls DecodeHeader expecting MFX_ERR_NONE results. I don't want to speculate more right now. Need some clarify to move forward. @xhaihao , can you please explain how ffmpeg handles seeking in terms of msdk api calls?

dmitryermilov avatar Mar 01 '22 21:03 dmitryermilov

Thanks!, Sorry, now I have replicated the same situation with per-frame logging (again, the "during" file is copied during the freeze) SDK_Tracer.zip

santomet avatar Mar 02 '22 08:03 santomet

Thank you. Last "good" frame :

17520 2022-3-2 9:23:10:293 function: MFXVideoDECODE_DecodeFrameAsync(0.2735 msec, status=MFX_ERR_NONE) - 


17520 2022-3-2 9:23:10:293 function: MFXVideoCORE_SyncOperation(mfxSession session=00000214D77AF830, mfxSyncPoint syncp=00000000000ECC00, mfxU32 wait=1000) +
17520 2022-3-2 9:23:10:293     mfxSession session=00000214D7D0E890
17520 2022-3-2 9:23:10:293     mfxSyncPoint* syncp=00000000000ECC00
17520 2022-3-2 9:23:10:293     mfxU32 wait=1000
17520 2022-3-2 9:23:10:293 >> MFXVideoCORE_SyncOperation called
17520 2022-3-2 9:23:10:293     mfxSession session=00000214D7D0E890
17520 2022-3-2 9:23:10:293     mfxSyncPoint* syncp=00000000000ECC00
17520 2022-3-2 9:23:10:293     mfxU32 wait=1000
17520 2022-3-2 9:23:10:293 function: MFXVideoCORE_SyncOperation(0.0137 msec, status=MFX_ERR_NONE) -

then huge number of logs like

17520 2022-3-2 9:23:11:303 function: MFXVideoCORE_SyncOperation(1000.6 msec, status=MFX_WRN_IN_EXECUTION) - 


17520 2022-3-2 9:23:11:303 function: MFXVideoCORE_SyncOperation(mfxSession session=00000214D77AF830, mfxSyncPoint syncp=00000000000ED000, mfxU32 wait=1000) +
17520 2022-3-2 9:23:11:303     mfxSession session=00000214D7D0E890
17520 2022-3-2 9:23:11:303     mfxSyncPoint* syncp=00000000000ED000
17520 2022-3-2 9:23:11:303     mfxU32 wait=1000
17520 2022-3-2 9:23:12:305 >> MFXVideoCORE_SyncOperation called
17520 2022-3-2 9:23:12:305     mfxSession session=00000214D7D0E890
17520 2022-3-2 9:23:12:305     mfxSyncPoint* syncp=00000000000ED000
17520 2022-3-2 9:23:12:305     mfxU32 wait=1000
17520 2022-3-2 9:23:12:305 function: MFXVideoCORE_SyncOperation(1001.21 msec, status=MFX_WRN_IN_EXECUTION) - 

SyncOperation() API function took 1 second! MFX_WRN_IN_EXECUTION means frame still in execution somewhere in underneath stack. Regular timing for SyncOperation is a couple of milliseconds. Such behavior is observed until:

17520 2022-3-2 9:24:9:402 function: MFXVideoCORE_SyncOperation(1000.02 msec, status=MFX_WRN_IN_EXECUTION) - 


17520 2022-3-2 9:24:9:402 function: MFXVideoCORE_SyncOperation(mfxSession session=00000214D77AF830, mfxSyncPoint syncp=00000000000ED000, mfxU32 wait=1000) +
17520 2022-3-2 9:24:9:402     mfxSession session=00000214D7D0E890
17520 2022-3-2 9:24:9:402     mfxSyncPoint* syncp=00000000000ED000
17520 2022-3-2 9:24:9:402     mfxU32 wait=1000
17520 2022-3-2 9:24:10:272 >> MFXVideoCORE_SyncOperation called
17520 2022-3-2 9:24:10:272     mfxSession session=00000214D7D0E890
17520 2022-3-2 9:24:10:273     mfxSyncPoint* syncp=00000000000ED000
17520 2022-3-2 9:24:10:273     mfxU32 wait=1000
17520 2022-3-2 9:24:10:273 function: MFXVideoCORE_SyncOperation(869.569 msec, status=MFX_ERR_NONE) - 

Timestamps indicate that the hang was during 1 minute which BTW is pretty interesting observation. Not clear why pipeline go further after 1 second. It's more or less clear now what's happening from API MSDK point of view. But of course real root cause is not clear. Most likely it's a kind of synchronization issue between MSDK and driver or theoretically a GPU hang which somehow wasn't reported to application properly (even it's reported, GPU hang is a hard stop for the pipeline; if so, usually need to recreate decoder instance). @santomet, I believe we will be able to go deeper only if we have a local reproducer. Previously you mentioned you can prepare a simplified version of your app. Can you please do it? Even if it takes time.

dmitryermilov avatar Mar 02 '22 13:03 dmitryermilov

@wangyan-intel , FYI. This issue reminds me MDP-68429, MDP-66102

dmitryermilov avatar Mar 02 '22 14:03 dmitryermilov

Thanks so far, I'll try to create the custom reproducer this week.

santomet avatar Mar 02 '22 14:03 santomet

@santomet , I think we fixed a similar issue at the end of December. Driver 101.1191 was build beginning of December so this driver doesn't have the fix I'm talking about. But very latest 101.1340 is from beginning of February, 101.1340 has the fix. Can you please try with latest driver? It's blind guess of course but There's a high chance the issue is already solved. If not, let's come back to the idea with reproducer.

dmitryermilov avatar Mar 02 '22 15:03 dmitryermilov

Hi @santomet , Any news?

dmitryermilov avatar Mar 04 '22 10:03 dmitryermilov

Hi @dmitryermilov, Sorry for this, I have been too sick since yesterday, I'll try to do it today.

santomet avatar Mar 04 '22 10:03 santomet

Sorry to hear it. Please get well!

dmitryermilov avatar Mar 04 '22 21:03 dmitryermilov

@xhaihao , can you please explain how ffmpeg handles seeking in terms of msdk api calls?

Usually user should call avcodec_flush_buffers() when seeking a stream, then qsv decoder clear buffers and re-initialize the context for decoding. FFmpeg calls MFXVideoDECODE_DecodeHeader() to get the stream parameters from the SDK.

xhaihao avatar Mar 06 '22 03:03 xhaihao

I'm almost OK now, thanks for the patience. I tried with 1340 and it also happens - but i couldn't catch it with tracer so far. I will be trying with 1404 from now on

santomet avatar Mar 10 '22 13:03 santomet

Got it, @santomet . Looking forward to the simplified reproducer.

dmitryermilov avatar Mar 11 '22 10:03 dmitryermilov

Thanks, @dmitryermilov, so far, with 1404 I haven't been able to catch it. Just for curiosity, was there any related change from 1340 to 1404?

santomet avatar Mar 14 '22 08:03 santomet

Unfortunately no, I can't find commits in MSDK and media-driver landed in the range [newer than 1340 and older than 1404] which potentially could fix the issue. Most likely it's just a low reproduction rate. However it's possible the change was on KMD level but I'm not an expert there. If even a very small probability of the issue is important for you, I'd still treat the issue as not fixed.

dmitryermilov avatar Mar 14 '22 20:03 dmitryermilov

If the reproduction rate is this low with 1404 driver (I have been trying to catch it for many hours now without any success) it might not be such an issue. I will start updating the drivers on deployed devices today.

I will keep the issue open for now. If we won't be able to encounter it again for several weeks, I'll close it. Thanks @dmitryermilov for the help and patience

santomet avatar Mar 15 '22 09:03 santomet

Always welcome, @santomet !

dmitryermilov avatar Mar 15 '22 15:03 dmitryermilov