moonlight-ios icon indicating copy to clipboard operation
moonlight-ios copied to clipboard

Improve stream smoothness

Open felipejfc opened this issue 3 years ago • 6 comments

@cgutman I bought a shield so now I have both an ATV 4k 2nd gen and a Shield Pro at home. It caught my attention that without using pacing, the stream in the shield feels more smooth than on the ATV 4K, even on the same networking settings.

I then looked at the code into possible points of improvement and figured out that the way we left the rendering after the PR that added frame pacing capabilities, maybe we're doing too much stuff inside each display link tick. I then created an exclusive thread to do all the processing and leave only rendering in the display link tick.

I felt that it increased the smoothness of streaming in both scenarios, with and without frame pacing enabled.

Wdyt?

felipejfc avatar Apr 05 '22 22:04 felipejfc

The idea is interesting, but I'd like to avoid a change like this if possible since it increases complexity quite a bit.

From what I can tell, you made 2 main changes here:

  1. Split processing of the decode units between the regular decoder thread and the display link callback
  2. Started using kCMSampleAttachmentKey_DoNotDisplay for samples we know will be immediately superseded

Change 1 is what I'd like to avoid if we can. While having to iterate through the frame byte-by-byte is not ideal, I haven't really seen this being a source of performance issues when I've profiled it in Instruments. Did you get a trace showing excessive time spent there? If so, I'm curious where most of the time was spent.

Change 2 seems fine to go in on its own if you find it helps. It's pretty straightforward and I'd accept it in a PR on its own while we sort out change 1 if you want to do that.

cgutman avatar Apr 10 '22 18:04 cgutman

@cgutman PTAL #503

Regarding the change 1, I'm not sure about it, I do know for sure that if we don't split these threads, some DisplayLink callbacks take ~2ms to execute and others 3 and 4 ms(although it is rare). I'm not sure now the impact this has, I need to study DisplayLink a little more. But I do feel that shield streaming is smoother than ATV when using no buffer for some reason and I think it might have to do with our drawing strategy, assuming that frames are being reconstructed fast enough in both platforms (I didn't profile this yet).

Not sure if this is related but since we're talking optimization, I did notice that reed Solomon algorithm is the heaviest procedure we're calling during the stream and I saw that the current implementation we're using in moonlight is naive. We could leverage other more sophisticated ones to improve processing time, this one for example. WDYT?

felipejfc avatar Apr 10 '22 23:04 felipejfc

Debug builds of Moonlight (LC_DEBUG defined) enable FEC validation mode. That's the Reed-Solomon activity you see on every frame.

Release builds only use Reed-Solomon FEC recovery when a packet is lost (a very small fraction of all frames). Make sure you're testing a release build when profiling. I believe the Profile scheme in Xcode will automatically do a release build.

cgutman avatar Apr 11 '22 04:04 cgutman

I think the juddering issue is not with raw performance. You were on the right track when you added a buffer with the framePacing option. But I think we need an algorithm to utilize the buffer once it's filled. Otherwise, we're just delaying the juddering that happens when the remote and local clocks are close to exact sync (when frame output starts flapping to either side of the "local vsync pulse" due to timing jitter).

I'm experimenting with some code that steers the buffer towards retaining 1 frame. If it exceeds one frame in either direction, it drops/adds as needed, but then locks the renderer into reading only one frame from the buffer for a period of time until the flapping event passes. The results are very promising so far on my setup. I finally have been able to play a game for over an hour without a juddering fit! If you want to try it, my branch is here, no warranties.

https://github.com/djrobx/moonlight-ios/tree/pacing-buffer

Update: I have been playing 2D side scrolling games all weekend with my code change. Buttery smooth at 4k/60hz now!

djrobx avatar Aug 05 '22 22:08 djrobx

I have a wired setup with MoCA adapters (Apple TV 4K 2021) so I'm not sure how much the branch by djrobx should affect me. I built it and gave it a shot at 4k/60hz, 100Mbps and it felt smooth! Granted I never had any juddering issues with the frame pacing option, would this change help to reduce latency with the "Frame Pacing" option in any way?

Starlank avatar Aug 16 '22 03:08 Starlank

Thanks for testing it out, glad it did not make anything worse!

My modified frame pacing code aims for a 1 frame buffer, so it should neither hurt nor improve latency. It might newly introduce a 1 frame delay on systems that have a refresh rate slightly faster than the host (something like client 60.0001hz with host 60.0000). The previous code didn't have logic to deal with that scenario, the host needed issue frames faster than the client could render them in order for it to build the 1 frame buffer. That probably happens often though, with TVs defaulting to 59.97hz and PCs usually running 60hz. I set my AppleTV to 4k60hz to try and get them closer.

I haven't made any changes to my branch because it's working very well for me. I've played some really timing sensitive games and have not had any issues with latency. I'm incredibly impressed with this project!

djrobx avatar Aug 22 '22 18:08 djrobx

@djrobx did you see the PR that ended up being merged? this one, it aims to 1 frame buffer as well but with a lot less code. Not sure what you're doing differently, maybe I'm missing something...

On another subject, I wonder why the stream is smooth in nvidia shield even without this 1 frame buffer (I have both a shield and an ATV at home)

felipejfc avatar Nov 03 '22 22:11 felipejfc

Closing since this was already merged in #482

felipejfc avatar Nov 04 '22 13:11 felipejfc