com.unity.webrtc icon indicating copy to clipboard operation
com.unity.webrtc copied to clipboard

[REQUEST]: Improve input latency for streaming from Unity camera to end users

Open doctorpangloss opened this issue 2 years ago • 18 comments

Is your feature request related to a problem?

This is an approach to improve the latency with the current architecture of the plugin.

The goal is to get the frame encoded within 7ms of finishing rendering (https://parsec.app/blog/nvidia-nvenc-outperforms-amd-vce-on-h-264-encoding-latency-in-parsec-co-op-sessions-713b9e1e048a), which is pretty close to NvFBC / Moonlight.

Right now, due to the architecture of the plugin, the time between rendering finishing and encoding the frame is about

  • 7ms for NVIDIA to encode
  • 4-10ms to copy resources
  • 0-20ms due to frame rate control
  • 0-24ms due to WaitForEndOfFrame totalling about 11-61ms of input latency when networking RTT is 0.

Describe the solution you'd like

  • the plugin event call should be implemented as a custom pass in HDRP, a render feature in URP and the current method (end of frame) for the obsolete pipeline
  • Input System should queue events directly from the data channel's thread instead of using WebRTC.Sync. this will require a fix for Input System's memory leak with its input event buffers (https://forum.unity.com/threads/bug-memory-leaks-crash-when-queueing-events-from-threads-other-than-the-main-thread.1329837/)
  • video frame rate control should be turned off
  • avoiding copying:
    • the render texture passed to webrtc must be retained on c# side until webrtc reports it is done encoding it. or, the plugin can give unity the render texture it should blit into. resizing is essential and the C# side knows the size earlier than the plugin does, so the former is going to be easier to do.
    • gpu memory buffers should be created with a reference to the ITexture2D*. then, they should create the appropriate "texture view" for the hardware/software encoder on demand inside the encoder queue thread. this means handle() does the map step in the current implementation, and ToI420 gets the cpu texture. a map looks just as expensive as a CopyResourceNativeV though...
    • in order to avoid the expensive map on windows, there must be DX11 and DX12 NVEncoderImpl support, instead of using CUDA encoder for all platforms

Describe alternatives you've considered

No response

Additional context

No response

doctorpangloss avatar Sep 01 '22 16:09 doctorpangloss

some other architectural ideas:

  • (+2ms by avoiding creating readback textures) when an encoder like NvEncoder initializes, it can add its FrameBufferFactory (CPUReadbackFrameBufferFactory, CUDABufferFactory, DX11TextureFactory, DX12TextureFactory, MTLMemorylessBufferFactory, MediaCodecBufferFactory) via a FrameBufferPool<IFrameBufferFactory> to a webrtc static. the plugin event will create frame buffers like it does now, delegated to the correct factory for buffers/wrappers, and correctly do the work on the rendering thread versus encoder thread versus in gpu without needing to see too many of the details
  • (+2-4ms by avoiding CPU waiting on GPU) the plugin can supply a gpu fence that is passed when the GPU finishes blitting the rendered frame into the pool of frames (or finished rendering when the plugin retains frames itself), and the encoder thread can wait until the frame is ready via the fence. this way, wherever you decide to do the copy, if you need to, the copy is done in the GPU queue via blitting (extremely fast, like 0.02ms in my measurements) instead of with long sync/flush/CPU wait times

doctorpangloss avatar Sep 05 '22 16:09 doctorpangloss

@doctorpangloss Thanks for your suggestions, we added tasks to investigate the improvements you suggested.

  • Improvement timing ot the plugin event for HDRP and URP (WRS-402)
  • Add "Framerate syncronization mode" for capturing video frame (WRS-354)
  • Receive data from RTCDataChannel on the worker thread instead of the main thread (WRS-404)
  • Use External Texture to retain GPU buffer in the initialize process on C# side instead of native side (WRS-405)
  • Use GPU Fence to reduce CPU wait time (WRS-342)

karasusan avatar Sep 06 '22 05:09 karasusan

Based on some further research:

  1. in URP and HDRP, use the same event / pass that the Unity Recorder does to call a new render event "TextureReady" with a GPU fence
  2. the video frame scheduler awaits this fence before encoding the frame.

in principle the video frame scheduler is already 90% of the way there for "framerate synchronization mode." the hard part for me is how to get the DX12Fence out of the command buffer's fence structure on the Unity side.

doctorpangloss avatar Nov 01 '22 21:11 doctorpangloss

@doctorpangloss Sorry but I couldn't find the event "TextureReady" in Graphics repository. Where did you find it?

karasusan avatar Nov 03 '22 08:11 karasusan

Hi

Is there a timeline for fixes for these issues? I'm using the UnityRenderStreaming package and from my own tests, in an extremely fast local network, streaming video from a PC to a Quest 2 at 1920x1080, 60fps, there is ~120ms latency.

When compared to other applications that do streaming from PC to Quest 2 (e.g. Virtual Desktop), under the same circumstances, it's an increase of ~80ms.

I've tried all video codecs and the results were very similar.

Thanks.

Joao-RI avatar Feb 21 '23 13:02 Joao-RI

@Joao-RI We already have the issue here. #838 The video codec might be the main reason for the latency.

karasusan avatar Feb 28 '23 02:02 karasusan

@karasusan Thanks for the reply.

We checked again with the UnityRenderStreaming 3.1.0-exp.6, between PC - Quest 2 and PC - PC. And the results between them are similar. Around 90ms latency.

There seems to have been an improvement between versions, but the results between platforms using H264 were identical. This tells me that either the decoder is also not using hardware acceleration on PC or there something else that is eating up some time, like suggested by @doctorpangloss.

VP9 and AV1 performed slightly better on Quest 2 with static screens (we didn't test these on PC), but the framerate felt unstable once there was a lot going on.

Joao-RI avatar Mar 01 '23 14:03 Joao-RI

@Joao-RI As you say, Quest2 should be able to use HWAcceleration or improve latency by suggested by @doctorpangloss . But We cannot say when it will be resolved.

kannan-xiao4 avatar Mar 07 '23 05:03 kannan-xiao4

I'm going to take another look at this issue. For my purposes, focusing on just DX11 and DX12 support is ideal. Some background:

  • Unity isn't a Linux engine.
    • With the death of Stadia, Vulkan / Linux streaming players are essentially an academic curiosity. Sorry. Even the TensorWorks guys have accepted this into their hearts.
    • The Unity editor on Linux can't build any projects I've tried correctly.
  • DX12 is the only graphics framework that explicitly has low latency streaming in mind, because Microsoft uses it for Xcloud.
  • HMDs are a mess.
  • The poor latency is the biggest obstacle to innovative uses of streaming. Not encoded transforms, not support for iOS, etc. etc. If it doesn't perform well, people will choose Unreal Pixelstreaming, even if the rest of that ecosystem is vaporware.

What does the future hold? I don't know. The Mac Studio is the highest performance converged commodity platform today. It's a better architecture for hosted streaming. It will take years for converged NVIDIA, AMD & Intel high performance APUs to reach the datacenter, in a way that is compatible with graphics and/or Windows. Too much emphasis on ML. Really depends what you are excited about.

doctorpangloss avatar Apr 03 '23 16:04 doctorpangloss

@doctorpangloss Thanks for sharing your opinion.

The Unity editor on Linux can't build any projects I've tried correctly.

I am curious this line, what do you mean?

karasusan avatar Apr 04 '23 06:04 karasusan

@doctorpangloss Thanks for sharing your opinion.

The Unity editor on Linux can't build any projects I've tried correctly.

I am curious this line, what do you mean?

We have experimented with build players running in Linux and Windows containers. A Windows or Linux standalone player built on the Linux headless editor always has flaws in any large / complex projects we tried. For example, the pivot point of an animated FBX character would be in the center of the animated person when built by a Linux headless editor, versus the correct position on Windows headless editor. Or various textures would be blank on (Linux headless editor, Windows standalone target), correct on (Windows headless editor, Windows standalone target).

doctorpangloss avatar Apr 04 '23 16:04 doctorpangloss

~~It might be sufficient to create a FenceScheduler instead of a VideoFrameScheduler, and pass a fence and value from Unity in a custom pass / render feature when the frame has been blitted to the render texture (i.e. rendering has finished).~~

doctorpangloss avatar Apr 11 '23 17:04 doctorpangloss

Simply bypassing the video frame scheduler eliminates a significant amount of latency. ~~What I do not comprehend is why this does not work in Standalone players. Do you have any insight as to what is different between the editor and standalone when it comes to render events?~~

doctorpangloss avatar Apr 12 '23 00:04 doctorpangloss

What I do not comprehend is why this does not work in Standalone players.

I am understanding that we need to skip the process of video frame scheduler which keeps streaming framerate to improve latency. Didn't it resolve the performance issue on standalone player?

karasusan avatar Apr 12 '23 05:04 karasusan

Didn't it resolve the performance issue on standalone player?

It does, I discovered my mistake. In my environment, the editor doesn't use a TURN relay, but standalone does :) Once I realized my error, I can see that significant latency improvements occurred.

Additionally, in SRP pipelines, you should use RenderPipelineManager.endCameraRendering and the SRP context to schedule the blit and plugin event, because WaitForEndOfFrame() on the editor waits for the editor player loop & target framerate so latency appears to be much higher but not because of plugin architecture. This is why when I tried to remove the video frame scheduler earlier this year, I saw no improvement.

On builtin, use this in a late update:

private CommandBuffer buffer = new();
...
// Encode is the same as WebRTC.Encode, writes the blits and plugin event but does not execute
((VideoTrackSource)m_source).Encode(buffer);
camera.RemoveCommandBuffer(CameraEvent.AfterEverything, buffer);
camera.AddCommandBuffer(CameraEvent.AfterEverything, buffer);

video frame scheduler which keeps streaming framerate

Users (or the plugin) should configure Application.targetFrameRate.

Also, I have not had luck munging the SDP for setting max framerate. Always seems to be 60. I would like 144.

The webrtc frame adapter already drops frames that are coming in too frequently. This seems acceptable to me.

Ideally you blit directly into the input resource for NvENC. We discussed this elsewhere.

I can't figure out what to do with the private pointer in the GraphicsFence struct from Unity. It is a null pointer in the render thread. Maybe you can look at the Unity source and figure it out.

Copying synchronously takes as long as 6ms sometimes. That will probably be closer to 0.02ms if you did the blitting from Unity and used a fence. I started working on this but it exceeds my C++ abilities.

doctorpangloss avatar Apr 12 '23 18:04 doctorpangloss

Framerate syncronization mode (WRS-354) https://github.com/Unity-Technologies/com.unity.webrtc/pull/950

karasusan avatar Jul 12 '23 11:07 karasusan

In the following conditions:

  • Use plugin from dbcb9f0e3c3e2c27b08faf742a8852d8ca3dfe41
  • Issue the plugin command at the end of the SRP
  • Application.targetFrameRate = -1
  • Skip the cpu texture
  • Typical frames (1080p)

I observe 2-7ms of NvEnc work and about 1ms of overhead. 0.25ms on the render thread for copying, 0.75ms in CopyResourceNativeV. So it's pretty great!

doctorpangloss avatar Jul 19 '23 05:07 doctorpangloss

Hello, @karasusan

I am running the sample E2ELatency locally (0ms RTT) and found the shown averaged latency fell into 25~60ms. (BTW, I also raised an issue about the calculation method of latency in https://github.com/Unity-Technologies/com.unity.webrtc/issues/1025)

It is out of my expectation since WebRTC is designed for "Real time". So I did a simple test in this DataChannel sample. I logged timestamps before sending and on receiving message. It shows about 0~1ms latency. This seems reasonable. I can not understand why video and message have such a big gap (latency difference).

I wonder whether these values are typical from your side? If it is, could anyone tell me any potential solutions for a time-sensitive interaction applications?

Thanks in advance!

ViGeng avatar Mar 25 '24 13:03 ViGeng