OvenMediaEngine icon indicating copy to clipboard operation
OvenMediaEngine copied to clipboard

Unhandled Hardware Encoder/Decoder errors should cause OME to exit

Open SceneCityDev opened this issue 10 months ago • 5 comments

It looks like the Intel Quicksync silicone and/or the Intel GPU drivers have some bugs, where sometimes (after a couple of days) the GPU can get stuck.

In this case in decoder_avc_qsv.cpp will trigger:

logte("An error occurred while sending a packet for decoding: Unhandled error (%d:%s) ", ret, err_msg);

However, once this happens, it happens forever, so you get an endless loop of this unhandled error. The only way to recover is to restart OME.

So, in reality, this is a fatal error.

I "fixed" this simply by adding an exit(1) line after that. This way systemd will handle re-starting OME.

The way it is right now is bad - a fatal error is completely ignored, and there is no way to monitor this - the monitoring API claims that all is fine. IMHO unhandled errors in a encoder/decoder should at least cause kill_flag to be set, or, even safer, OME terminating.

SceneCityDev avatar Apr 03 '24 20:04 SceneCityDev

Thank you for reporting the issue. In these cases, it is recommended to regenerate the encoder and decoder, or, if that is not enough, to regenerate the stream. However, we have not yet taken action after the hardware encoder crashes. This is because we believed that Nvidia and Xilinx would not crash.

I think improving this will take a long time.

getroot avatar Apr 04 '24 06:04 getroot

Thank you for reporting the issue. In these cases, it is recommended to regenerate the encoder and decoder, or, if that is not enough, to regenerate the stream. However, we have not yet taken action after the hardware encoder crashes. This is because we believed that Nvidia and Xilinx would not crash. I think improving this will take a long time.

We used to use NVIDIA Tesla P4's for encoding and they would crash almost daily, no amount of driver updates ever fixed it.

irlkitcom avatar Apr 05 '24 15:04 irlkitcom

Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done?

If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages.

Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"?

SceneCityDev avatar Apr 05 '24 18:04 SceneCityDev

Would an OME restart then work, or would the NVidia drivers be permanently broken until a reboot is done?

If restarting OME also works for you, I'd vote to try add kill_flag, or if that is not enough, exit(1) to the error messages.

Do you have logs of the moment this is happening with NVidia? Is it also "Unhandled error"?

Sorry, I should have stated that this wasn't with OME, this was two custom systems that ran on Proxmox first and then Windows and they both had issues. On Linux, you could recover without a reboot but on Windows it often caused a Blue Screen, it's been awhile and I don't have the hardware or software anymore so I cannot test.

irlkitcom avatar Apr 06 '24 00:04 irlkitcom

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 05 '24 03:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 04 '24 18:08 stale[bot]