ustreamer Supporting h264/WebRTC for separate capture devices?

I was interested in working on and contributing a PR to allow the new WebRTC functionality to be used in Pi-KVM when the video capture and encoding is occurring on another machine. If this sounds ok, just let me know. You could either assign the issue to me or close it and I'll come back when I have the PR ready. Here are the details:

I use Pi-KVM in a couple ways. The first is the fairly standard setup with capture via CSI device. The second is a bit different. I use the Pi and kvmd to do the keyboard and mouse control, but have a custom capture server running on a different machine (allows for capture at higher resolutions and framerates than CSI can do).

For mjpeg, all I had to do was override the streamer config to:

    streamer:
        unix: ''
        host: {{ capture_host_ip }}
        port: {{ capture_host_ustreamer_port }}
        cmd:
            - "/bin/sleep"
            - "infinity"

and then make sure my capture host implemented the same web API as ustreamer. The modularity of your system is nice and being able to have capture on one machine and keyboard/mouse on another could provide some nice upgrade paths in the future where the device doing capture would not need to also have USB OTG.

Anyways, with the new h264 webrtc support, doing this isn't as easy. I was able to prototype getting it working... the basic steps are:

Build janus with the streaming plugin enabled (https://janus.conf.meetecho.com/docs/streaming.html)
Replace the jcfg file for ustreamer with a basic jcfg file for a simple janus streamer that takes an rtp stream on a port
Modify stream.js to use the janus streaming plugin (and associated id) instead of the janus ustreamer plugin
Have the capture device transmit an rtp stream to the given port on the kvmd/janus device

Only step 3 seems a bit hacky. I"m imagining the name of the janus plugin and other data needed could be placed in the config file with defaults for the normal ustreamer case, and then overriding would cause stream.js to behave differently as needed. Thoughts?

Jun 18 '21 16:06 aggieNick02

I would prefer that you make something like a small daemon for the ustreamer plugin to feed it a video stream. Thus, you will only need to replace the ustreamer startup command with your daemon, and you will not have to rewrite anything else. This will be easy enough to do if you use my C code to work with shared memory.

Jun 18 '21 16:06 mdevaev

Ah, I think I understand. So the ustreamer janus plugin would stay as is, and my daemon would fill the memory sink instead of ustreamer writing to the sink?

I think I can do that with my current case pretty easily. Had a couple questions to try to understand better the benefits of going through the ustreamer plugin.

Are there other benefits besides minimal code modification to passing the frames through the ustreamer plugin vs the janus streaming plugin? There are actually some devices out there though that do all the encoding and rtp wrapping in hardware and ship the result to an rtp endpoint. (Examples include cards like the Matrox Mura IPX encode line that acutally have a network port on it.) In that case the pi daemon would have to unwrap the rtp to get the raw frames to hand off to the ustreamer plugin, which would then wrap them back up.

Even in my case where my capture machine is holding the encoded frames in memory, it may make sense to wrap them in rtp for transmission to the pi-kvm (versus shipping the frames with a simple custom protocol). If I did use rtp to ship to the Pi-KVM, It feels a little weird to have a small daemon whose job is to receive and parse an rtp stream to get the frames to hand to the ustreamer plugin to build back up a new rtp stream.

Thanks for helping me understand the best way to accomplish this.

Jun 18 '21 17:06 aggieNick02

It seems to me that the minimal footprint is quite a weighty argument, especially considering that the janus streaming plugin is a huge harvester for many thousands of lines. By using it, you will first increase the entropy inside the kvmd code to support it. Secondly, you will have to deal with its bugs, while the ustreamer plugin is perfectly tested.

As for RTP, the repackaging operation is easy enough to be ignored. For example, for gstreamer, this is quite typical for many pipelines. And if you make a custom protocol (for example, if you transmit frames over tcp, and then put them in shared memory), then everything will be even easier.

Jun 18 '21 18:06 mdevaev

Those are both good and fair points. I turned to the streaming plugin and sending RTP after thinking a little bit about the shared memory approach just because it was quicker for me to implement - I'd read the shared memory code but hadn't followed it 100%. I had a quick question about the shared memory approach to make sure I do it right from the server side.

The shared memory holds space for a single frame. Is there anything that stops the server (ustreamer) from writing faster than the client (ustreamer plugin) reads, which could lead to frames being missed on the plugin side? I didn't see anything like that present - perhaps with the capture and encoding happening on the pi, it just doesn't ever happen.

But I could see the timing characteristics being different when frames are coming over the network, with multiple video frames arriving in a burst and my little daemon needing to do something to make sure a frame, especially a key frame, is not overwritten in shared memory before it is read by the janus ustreamer plugin. Does that make sense? Sorry if there is protection against this or a reason it isn't a problem and I missed it.

Jun 18 '21 22:06 aggieNick02

Indeed, there is no protection against this and nothing can be done here without organizing a full-fledged queue. I just made the polling interval small enough that it wouldn't cause any problems. Practice shows that this works quite well.

Jun 18 '21 23:06 mdevaev

Sounds good, thank you.

Jun 18 '21 23:06 aggieNick02

I advise you to build your application based on the ustreamer's libs like logging, memsink, etc, then you can reuse a significant part of the code.

Jun 18 '21 23:06 mdevaev

Thanks for the advice - I think I should be able to write a sample proof-of-concept pretty quickly just building against the ustreamer code as all I then really have to write is receiving the frames. I'll share once I have something running.

I meant to ask, I noticed the python files for the memory sink. Did you create a python wrapper around that, or was it just something you were playing with?

Jun 18 '21 23:06 aggieNick02

This is the client side (i.e. read-only wrapper) to receive frames from ustreamer for the vnc server in kvmd

Jun 18 '21 23:06 mdevaev

Ah, ok, thanks. Had some other stuff interrupt my work with the shared memory, but expect to get back to that now.

Jun 24 '21 03:06 aggieNick02

Sorry I got a bit delayed on this. I finally got back around to it recently. It's been years since I had to write strict C, and I'd never used libevent before, so it was fun to get into.

I settled on using libevent and its chunked functionality to ingest an "mh264" stream from my capture machine. It's basically the same as mjpeg but with h264 frames instead of jpegs. We can all wish browsers actually supported that. :-)

Right now I'm dealing with an issue where the picture is coming through but garbled. Usually it will have a large horizontal green rectangle covering 1/3 of the image. It's probably on my end, but thought I'd mention in case you hit similar while you were developing. If I use the janus streaming plugin and send the frames over rtp by using the ffmpeg API, everything works right. But the same frames sent over http and then memsink to the janus ustreamer plugin are having the issue.

I logged checksums before the frames are sent over http on the capture machine and before they are sent to the memsink, so it's not a transport corruption. Digging into _rtp_process_nalu, etc at the moment in case there is something about my h264 encoding that is somehow not compatible with it.

Jul 16 '21 19:07 aggieNick02

This sounds strange because I don't have any similar problems. I use a similar transport for VNC and do not see something like this.

Jul 16 '21 19:07 mdevaev

Thanks for the quick response. That is useful to know; my best guess is incompatibility between the encoder output (generated by nvenc on an nVidia gpu) and the plugin, if that makes sense. But maybe I've done something else wrong. I'll keep playing on my end.

Just as a sanity check, all I'm doing is this (capture is a 4k capture, but frame width,height,stride are not used by the ustreamer plugin so I don't think they actually matter, and I don't yet bother with marking keyframes in the frame struct, which afaict is a performance optimization when used with memsink_server_check).

memsink_s *sink = memsink_init("H264", "kvmd::ustreamer::h264",true, 0660, false, 10, 1);
uint8_t *buffer = malloc(CFG_MEMSINK_MAX_DATA);
int content_length = 0;
frame_s *frame = frame_init();
frame->format = V4L2_PIX_FMT_H264;
frame->width=3840;
frame->height=2160;
frame->stride=3840*3;
while(true)
{
    //Read next frame into buffer with length content_length
    frame_set_data(frame, buffer, content_length);
    memsink_server_put(sink, frame);
}

Jul 16 '21 20:07 aggieNick02

I don't see any problems in this code :thinking:

Jul 16 '21 22:07 mdevaev

Thanks for checking it so quickly for something I might have missed. I'm hoping to get to look a little more tomorrow but then will be away from work for a week. Either way I'll keep digging.

Proabably the best way forward for me is write out a raw h264 file that I can play back from the capture machine (right now every run is with a real capture), and then verify the exact same bitstream is sent to the ffmpeg rtp stream and the memsink source stream in the janus ustreamer plugin. If that checks out I'll dig into any differences in how the rtp processors deal with the bitstream. I'll update when I have more.

Jul 17 '21 04:07 aggieNick02

Maybe your h264 uses a different profile? I used baseline.

Jul 18 '21 12:07 mdevaev

That was a good thought. The nvidia encoder uses a "default" profile by default (doesn't say what that actually is). I forced it to baseline but still have the same result. I'm finally back after my break so digging again.

Jul 29 '21 20:07 aggieNick02

I figured out the problem and now know more about rtp :-). I looked at the difference between the output of ffmpeg's rtp code and ustreamer's. They use different packet sizes by default, and ffmpeg employs STAP-A to combine smaller NALs (like type 7 and 8) which introduces some noise in the diffs but are not the cause of the issue. Forcing ffmpeg to use the same packet size eliminated most of the noise.

The issue is in the marker bit in the rtp header. It is to be set for the very last rtp packet of the access unit, but rtp.c sets it for the last rtp packet of each type 1-5 NAL. I'm encoding 4k video which ends up a NAL type 7, NAL type 8, and 4 NAL type 5 for a single frame. I'm guessing (and will verify shortly) that most/all of the time the h264 generated by capture/encoding on the PI only has a single NAL type 5, which is why the issue isn't seen there.

I can send a PR or let you make the change if you prefer. I think all that needs to happen is eliminating the local bool marked, along with setting in the switch, in _rtp_process_nalu, replacing it with a parameter to _rtp_process_nalu. Then in rtp_wrap_h264, the new parameter should be passed as false to the call in the while loop, and true to the call at the end of the function.

Aug 03 '21 18:08 aggieNick02

Shoe the patch here please

Aug 03 '21 18:08 mdevaev

diff --git a/pikvm/ustreamer/janus/src/rtp.c b/pikvm/ustreamer/janus/src/rtp.c
index 90c06526..acc2a8ec 100644
--- a/pikvm/ustreamer/janus/src/rtp.c
+++ b/pikvm/ustreamer/janus/src/rtp.c
@@ -30,7 +30,7 @@
 #define PRE 3 // Annex B prefix length


-void _rtp_process_nalu(rtp_s *rtp, const uint8_t *data, size_t size, uint32_t pts, rtp_callback_f callback);
+void _rtp_process_nalu(rtp_s *rtp, const uint8_t *data, size_t size, uint32_t pts, bool marked, rtp_callback_f callback);
 static void _rtp_write_header(rtp_s *rtp, uint32_t pts, bool marked);

 static ssize_t _find_annexb(const uint8_t *data, size_t size);
@@ -118,7 +118,7 @@ void rtp_wrap_h264(rtp_s *rtp, const frame_s *frame, rtp_callback_f callback) {
                        if (data[size - 1] == 0) { // Check for extra 00
                                --size;
                        }
-                       _rtp_process_nalu(rtp, data, size, pts, callback);
+                       _rtp_process_nalu(rtp, data, size, pts, false, callback);
                }

                last_offset = offset;
@@ -127,18 +127,16 @@ void rtp_wrap_h264(rtp_s *rtp, const frame_s *frame, rtp_callback_f callback) {
        if (last_offset >= 0) {
                const uint8_t *data = frame->data + last_offset + PRE;
                size_t size = frame->used - last_offset - PRE;
-               _rtp_process_nalu(rtp, data, size, pts, callback);
+               _rtp_process_nalu(rtp, data, size, pts, true, callback);
        }
 }

-void _rtp_process_nalu(rtp_s *rtp, const uint8_t *data, size_t size, uint32_t pts, rtp_callback_f callback) {
+void _rtp_process_nalu(rtp_s *rtp, const uint8_t *data, size_t size, uint32_t pts, bool marked, rtp_callback_f callback) {
        const unsigned ref_idc = (data[0] >> 5) & 3;
        const unsigned type = data[0] & 0x1F;

-       bool marked = false; // M=1 if this is an access unit
        frame_s *ps = NULL;
        switch (type) {
-               case 1 ... 5: marked = true; break;
                case 7: ps = rtp->sps; break;
                case 8: ps = rtp->pps; break;
        }

Aug 03 '21 18:08 aggieNick02

Aha, I see. I don't know too much about NAL and took this code from another project. If you can be 100 percent sure that this will not break anything and test it (including with raspberry), then I will accept the patch. I will also test it, of course.

Aug 03 '21 19:08 mdevaev

Ah, ok. I can't ever be 100% but I'll do the testing and validating to get as 99.999 :-) . I'm working on testing the normal capture setup (and verifying that those frames indeed only have one type 5 NAL) and I'll post back when that's done. Based on reading the spec, I'm pretty confident this is the way it should be. Verifying that there is only a single type 5 NAL will actually mean that there's no difference in output bits when the pi is the encoder, which should give us as close to 100% confidence as possible that nothing breaks.

I also don't know alot about NAL (just what I learned to debug this), but having the ffmpeg source and behavior was really valuable. Also the relevant bit from the spec is here (search for marker bit): https://datatracker.ietf.org/doc/html/rfc6184#page-10: "Marker bit (M): 1 bit Set for the very last packet of the access unit indicated by the RTP timestamp," An access unit is a set of NALs that contain a primary coded picture - e.g., a video frame

Aug 03 '21 20:08 aggieNick02

The author of the original code said that this is really a bug:

My code is designed for encoder output that isn't producing strips but whole frames at a time, so probably it's just a bug.

Okay, please send PR, I'll test it and apply tomorrow.

Aug 03 '21 21:08 mdevaev

Created the PR here: https://github.com/pikvm/ustreamer/pull/119

I had some issues getting setup to test that I fjust resolved, but am out of time for the day., So I haven't validated as planned for regular capture and verifying regular pi capture has only 1 type 1-5 NALU. That's first on my list for the morning, and I'll post as soon as that's done.

Thanks!

Aug 04 '21 04:08 aggieNick02

Was able to test the fix running with standard pi capture. Everything still works as expected, and I verified the pi generated encoding indeed only has 1 type 1-5 NALU per frame, even when the captured source is the max 1080p with a looping video playing and the h264 bitrate is turned up to the maximum. So this bug seems like it would never have affected the standard use case.

Just let me know if you'd like anything else tested. Thanks!

Aug 04 '21 17:08 aggieNick02

Thanks! I will test it myself today and merge.

Aug 04 '21 21:08 mdevaev

I was just thinking. What happens if the last packet is the one that was in the loop and the condition after the loop is not met? Then marked will be false for the last package from the series. It seems that this will cause a new bug.

https://github.com/pikvm/ustreamer/blob/master/janus/src/rtp.c#L121

Aug 04 '21 22:08 mdevaev

I thought that at first too, but that situation can't actually happen. I imagine the code could be written a little bit differently to make that more clear. The gist is that the loop and subsequent if are structured so that the last packet is always processed by the if on line 127. The only reason for that if is to handle the degenerate case gracefully when there are no NALUs at all.

Here's a more careful breakdown:

The while loop iterates until it hits the break on line 111 when offset < 0. There are two possibilities when that happens:

This was actually the first iteration of the while loop. In this case, last_offset = -PRE, there are no NALUs at all, and _rtp_process_nalu is never called anywhere in the rtp_wrap_h264
This is not the first iteration of the while loop. Therefore line 124 has been executed at least once, setting last_offset to offset which is >=0 since it was >=0 after the check on line 110, and then had a non-negative number added to it on line 113. Since line 124 is the only place that sets last_offset other than its initial value, this means when the loop exits after more than 1 iteration, last_offset is non-negative and _rtp_process_nalu will be called with true for marked on line 130

Aug 05 '21 00:08 aggieNick02

Thanks, it sounds convincing. I also looked at it again. If you want, you can try to refactor this function. I'm open to it.

Aug 05 '21 04:08 mdevaev

Haha, sure. I'll give it a go. The ffmpeg code for the same thing does it more naturally, but it also has some "fun" code too that isn't obvious until you figure out what it is doing. It's very sharp and efficient C code, but takes some staring to understand. Like this code that skips the NALU start bytes in a buffer: while (!*(r++)); (NALU start bytes can be 00 00 01 or 00 00 00 01, so if you know you are on the beginning of start bytes, it's a nice compact way to advance past them. But I stared for a few minutes without a comment before I realized.) And then the code that searches for the next start code is super-optimized to be able to generally advance 4 bytes at a time, regardless of endianness, doing 3 boolean operations and a subtraction on the integer formed by the 4 bytes - it's something else. :-)

But yeah, I'll see what I can do with the function and send a PR likely today.

Aug 05 '21 04:08 aggieNick02

ustreamer ustreamer copied to clipboard

Supporting h264/WebRTC for separate capture devices?

ustreamer
ustreamer copied to clipboard