Stuck infinitely polling H.264 encoder device
I believe this is the root cause of https://github.com/pikvm/pikvm/issues/1217, or related to it.
In that issue, the PiKVM stream breaks when the display attached to the video capture device goes to sleep (for example, when the remote computer sleeps). The ustreamer process needs to be restarted to resume streaming.
When this happens, the stream thread is calling _H264_PUT() which calls _m2m_encoder_compress_raw(). This function appears to try to give data to the encoder device, then polls it for a response. However, in the bug condition, the encoder device never provides a response.
H264: Compressing new frame; force_key=0 ...
H264: Using INPUT-DMA buffer=0
H264: Sending INPUT-DMA buffer ...
H264: Polling encoder ...
H264: Polling encoder ...
H264: Polling encoder ...
https://github.com/pikvm/ustreamer/blob/2dddb879bc715805ae9e733f7bdcc78a6a23b123/src/ustreamer/m2m.c#L444-L453
The loop repeats forever until there is a POLLIN event. (There is no checking for other types of errors, but with a debugger I confirmed the revents value is 0x0000, not an error condition.)
Anyway, this infinite polling loop has no overall timeout. Thus, if for some reason the encoder doesn't give us anything back, we get stuck here forever.
Some suggestions:
-
[ ] Add an overall timeout for the cumulative time spent polling, and
goto error. This would kick the error back tous_h264_stream_process()but I'm not sure what the error handling is (e.g., does it reset the encoder device).Worth considering if hitting a 1000ms timeout even once from the encoder is already strong evidence that it's not working, considering how quickly we need the encoded frame.
-
[ ] Add some temporary tracing to see the last things sent to the encoder so we can create a reproducible test case about why it doesn't respond.
-
[ ] Move the H.264 encoder out of the
streamthread to a dedicated thread, so if it dies, at least the JPEG worker threads can continue to run. -
[ ] Handle events returned from
poll()other thanPOLLIN.
# ustreamer --version
5.48
Wow. Thanks for the investigation. I believe that the reason lies in the fact that the H264 buffer is transmitted via DMA directly from HDMI capture. To confirm that this is a DMA, you can temporarily disable it here:
https://github.com/pikvm/ustreamer/blob/master/src/ustreamer/m2m.c#L55
Just change true to false. If the error disappears, then we can investigate the DMA further.
In this case, as a quick dirty fix, you can try reinitializing the encoder inside us_h264_stream_process() if us_m2m_encoder_compress() returned with an error. Well, limit the number of poll iterations, for example, wait no more than 100ms. It would be great if it worked.
Also I'll try to deal with it this tomorrow myself.
Sup. I couldn't reproduce it with my setup. Perhaps this is due to some features of the HDMI source, it's strange.
Could you check what I described above?
I have not compiled the code before so it would be some time before I might have a chance to try any modifications.
However, if you are able to increase debug logging and implement the cumulative timeout behavior I can test it for you.
If I had debug symbols I might be able to trace the packets sent to the encoder. Are those published as a package somewhere?
I didn't pack debug symbols. You can just compile only ustreamer binary and install it over system package. Create config.mk in the project root:
WITH_SYSTEMD=1
WITH_GPIO=1
PREFIX=/usr
and run make all install.
The issue is not DMA related, it reproduces even with DMA disabled:
INFO --- => -- DEBUG [2113.770 stream] -- H264: Compressing new frame; force_key=0 ...
INFO --- => -- DEBUG [2113.770 stream] -- H264: Grabbing INPUT buffer ...
INFO --- => -- DEBUG [2113.770 stream] -- H264: Grabbed INPUT buffer=0
INFO --- => -- DEBUG [2113.783 stream] -- H264: Sending (releasing) INPUT buffer size 4147200 ...
INFO --- => -- DEBUG [2113.783 stream] -- H264: Polling encoder fd 10 ...
INFO --- => -- DEBUG [2114.784 stream] -- H264: Polling encoder fd 10 ...
Not sure if this is useful, but this is the contents of the last buffer submitted to the encoder before it gets snowcrashed. badframe.zip
I don't really know what I'm doing but I tried using VIDIOC_QUERYBUF to check on the status of the buffers when we get stuck waiting for a response. The output is 0x4003 = V4L2_BUF_FLAG_TIMESTAMP_COPY | V4L2_BUF_FLAG_QUEUED | V4L2_BUF_FLAG_MAPPED.
I'll make a branch and try to reinitialize the encoder.
Please check the test branch
Sup?
I ran a cycle of 3 tests with the test branch and confirmed that when the device suspends, shortly after PiKVM shows a "No Signal" message. The video stream restarts when I wake the device.
That said, I only saw "The encoder wait is too long" one time in the logs?
Maybe it doesn't always get stuck on encoding?
Please test more, if this fix helps, I will add it to the upstream in its current form.
@rgov Sup?
I added a timeout to the new release. Most likely, the problem is due to a defect in the GPU silicon, I have seen various glitches related to this.