ustreamer icon indicating copy to clipboard operation
ustreamer copied to clipboard

Stuck infinitely polling H.264 encoder device

Open rgov opened this issue 2 years ago • 14 comments

I believe this is the root cause of https://github.com/pikvm/pikvm/issues/1217, or related to it.

In that issue, the PiKVM stream breaks when the display attached to the video capture device goes to sleep (for example, when the remote computer sleeps). The ustreamer process needs to be restarted to resume streaming.

When this happens, the stream thread is calling _H264_PUT() which calls _m2m_encoder_compress_raw(). This function appears to try to give data to the encoder device, then polls it for a response. However, in the bug condition, the encoder device never provides a response.

H264: Compressing new frame; force_key=0 ...
H264: Using INPUT-DMA buffer=0
H264: Sending INPUT-DMA buffer ...
H264: Polling encoder ...
H264: Polling encoder ...
H264: Polling encoder ...

https://github.com/pikvm/ustreamer/blob/2dddb879bc715805ae9e733f7bdcc78a6a23b123/src/ustreamer/m2m.c#L444-L453

The loop repeats forever until there is a POLLIN event. (There is no checking for other types of errors, but with a debugger I confirmed the revents value is 0x0000, not an error condition.)

Anyway, this infinite polling loop has no overall timeout. Thus, if for some reason the encoder doesn't give us anything back, we get stuck here forever.

Some suggestions:

  • [ ] Add an overall timeout for the cumulative time spent polling, and goto error. This would kick the error back to us_h264_stream_process() but I'm not sure what the error handling is (e.g., does it reset the encoder device).

    Worth considering if hitting a 1000ms timeout even once from the encoder is already strong evidence that it's not working, considering how quickly we need the encoded frame.

  • [ ] Add some temporary tracing to see the last things sent to the encoder so we can create a reproducible test case about why it doesn't respond.

  • [ ] Move the H.264 encoder out of the stream thread to a dedicated thread, so if it dies, at least the JPEG worker threads can continue to run.

  • [ ] Handle events returned from poll() other than POLLIN.

# ustreamer --version
5.48

rgov avatar Jan 18 '24 02:01 rgov

Wow. Thanks for the investigation. I believe that the reason lies in the fact that the H264 buffer is transmitted via DMA directly from HDMI capture. To confirm that this is a DMA, you can temporarily disable it here:

https://github.com/pikvm/ustreamer/blob/master/src/ustreamer/m2m.c#L55

Just change true to false. If the error disappears, then we can investigate the DMA further.

In this case, as a quick dirty fix, you can try reinitializing the encoder inside us_h264_stream_process() if us_m2m_encoder_compress() returned with an error. Well, limit the number of poll iterations, for example, wait no more than 100ms. It would be great if it worked.

Also I'll try to deal with it this tomorrow myself.

mdevaev avatar Jan 18 '24 13:01 mdevaev

Sup. I couldn't reproduce it with my setup. Perhaps this is due to some features of the HDMI source, it's strange.

Could you check what I described above?

mdevaev avatar Jan 19 '24 23:01 mdevaev

I have not compiled the code before so it would be some time before I might have a chance to try any modifications.

However, if you are able to increase debug logging and implement the cumulative timeout behavior I can test it for you.

If I had debug symbols I might be able to trace the packets sent to the encoder. Are those published as a package somewhere?

rgov avatar Jan 20 '24 00:01 rgov

I didn't pack debug symbols. You can just compile only ustreamer binary and install it over system package. Create config.mk in the project root:

WITH_SYSTEMD=1
WITH_GPIO=1
PREFIX=/usr

and run make all install.

mdevaev avatar Jan 20 '24 01:01 mdevaev

The issue is not DMA related, it reproduces even with DMA disabled:

INFO --- => -- DEBUG [2113.770    stream] -- H264: Compressing new frame; force_key=0 ...
INFO --- => -- DEBUG [2113.770    stream] -- H264: Grabbing INPUT buffer ...
INFO --- => -- DEBUG [2113.770    stream] -- H264: Grabbed INPUT buffer=0
INFO --- => -- DEBUG [2113.783    stream] -- H264: Sending (releasing) INPUT buffer size 4147200 ...
INFO --- => -- DEBUG [2113.783    stream] -- H264: Polling encoder fd 10 ...
INFO --- => -- DEBUG [2114.784    stream] -- H264: Polling encoder fd 10 ...

rgov avatar Jan 20 '24 22:01 rgov

Not sure if this is useful, but this is the contents of the last buffer submitted to the encoder before it gets snowcrashed. badframe.zip

rgov avatar Jan 20 '24 23:01 rgov

I don't really know what I'm doing but I tried using VIDIOC_QUERYBUF to check on the status of the buffers when we get stuck waiting for a response. The output is 0x4003 = V4L2_BUF_FLAG_TIMESTAMP_COPY | V4L2_BUF_FLAG_QUEUED | V4L2_BUF_FLAG_MAPPED.

rgov avatar Jan 21 '24 00:01 rgov

I'll make a branch and try to reinitialize the encoder.

mdevaev avatar Jan 21 '24 01:01 mdevaev

Please check the test branch

mdevaev avatar Jan 21 '24 02:01 mdevaev

Sup?

mdevaev avatar Jan 28 '24 12:01 mdevaev

I ran a cycle of 3 tests with the test branch and confirmed that when the device suspends, shortly after PiKVM shows a "No Signal" message. The video stream restarts when I wake the device.

rgov avatar Jan 30 '24 21:01 rgov

That said, I only saw "The encoder wait is too long" one time in the logs?

rgov avatar Jan 30 '24 21:01 rgov

Maybe it doesn't always get stuck on encoding?

Please test more, if this fix helps, I will add it to the upstream in its current form.

mdevaev avatar Jan 31 '24 10:01 mdevaev

@rgov Sup?

mdevaev avatar Feb 05 '24 17:02 mdevaev

I added a timeout to the new release. Most likely, the problem is due to a defect in the GPU silicon, I have seen various glitches related to this.

mdevaev avatar Mar 14 '24 14:03 mdevaev