refterm icon indicating copy to clipboard operation
refterm copied to clipboard

Stdout pass-through rate limited by vsync?

Open mrylmz opened this issue 3 years ago • 19 comments

It seams that the processing of the stdout is rate limited to the fps. Is this expected behaviour in the context of terminals to render every character once?

I'm not into terminals and don't know if there are any specifications required to say that the rendering is "terminal" conform.

The results using splat on a 1gb text file containing the alphabet:

VSync ON

Size=72x22
RenderFPS=~60

Total sink time: 315.681s 0.003217gb/sec

VSync OFF

Size=72x22
RenderFPS=~5000-8000

Total sink time: 34.638s 0.029321gb/s

mrylmz avatar Jul 07 '21 07:07 mrylmz

Well - I guess I don't know if it's expected behavior or not.

But the reason it happens in refterm is because I kept it single-threaded on the render side. So when DX blocks on the vsync, nobody services the pipe. There are two ways to fix it, I'm not sure which one would be preferable:

  • Make another thread. This is the best solution, IMO, and what I would normally do, but I didn't want the code to be harder to understand.
  • Keep only one terminal thread, but if vsync is detected, have the pipe service loop.

It might be best to go with the second option for readability? I did notice Martins had a thing called a "frame waitable" handle for DX, which I never looked at, but it may be that that would do what I need. If that's a handle that says "I'm ready for another frame now", I could just service the pipe until I see that handle go.

I might try that, because if that works, then that is probably sufficient to solve the problem but wouldn't generally be harder for people to read/understand.

- Casey

cmuratori avatar Jul 07 '21 08:07 cmuratori

If that's a handle that says "I'm ready for another frame now",

That is exactly what that handle is. It is event that will be set to signaled when Present won't block.

mmozeiko avatar Jul 07 '21 08:07 mmozeiko

That is so sweet!!!

I'm on it.

- Casey

cmuratori avatar Jul 07 '21 08:07 cmuratori

Sounds great!

Maybe it would also be possible to just limit the processing part by measuring the time spend forcing it to not drop the target frame rate. Like maintaining a maximum through-put / frame limit.

mrylmz avatar Jul 07 '21 08:07 mrylmz

It already can sort-of do that, by just setting the inbound pipe read size to be something that is processable quickly. I'm not sure what that would look like in practice though - do you mean like with a feedback loop where it tries to guess how much time it will take to process the inbound data based on how much time it took last time, and adjust the size?

- Casey

cmuratori avatar Jul 07 '21 08:07 cmuratori

Yes exactly, not sure if it will be a reliable solution but could be a simple one instead of going multi-threaded and also depending too much on the rendering part.

mrylmz avatar Jul 07 '21 08:07 mrylmz

I pushed a v2 just recently. Still in the works, but I added the handle check. I haven't had a chance to test it yet though, so it's probably not working yet :)

- Casey

cmuratori avatar Jul 08 '21 00:07 cmuratori

I've just tested v2 again with a 1gb text file, the numbers look good now. Having vsync on is now twice as fast, is the default path doing anything different here?

VSync ON Total sink time: 3.114s (0.351056gb/s)

VSync OFF Total sink time: 6.070s (0.180097gb/s)

Using fastpipe:

VSync ON Total sink time: 1.407s (0.776964gb/s)

VSync OFF Total sink time: 1.384s (0.789876gb/s)

mrylmz avatar Jul 08 '21 07:07 mrylmz

Yeah, there are some issues that will need to be addressed as I round everything up. Basically the terminal now only uses CPU/GPU when it needs to, which means it waits on handles, and that code is probably not where it should be at the moment largely because it uses CreatePipe which is kind of broken, and in general the code just hasn't been scrutinized the way it should be.

If you want to do another quick test, you can type "throttle" at the command line to turn throttling on and off. With throttling off, it will do what it used to do where it just used all available CPU and GPU. With throttling on, it will only wake up when a pipe has something coming in.

- Casey

cmuratori avatar Jul 08 '21 08:07 cmuratori

Yes with throttling turned off we reach the same numbers as with vsync on! It is interesting for me to see running the same test with splat2 is climbing up to 3.2gb/s seams like fwrite is doing more work quick lookup in the internet leads to sources like back buffering and being thread-safe to some degree.

The initial vsync issue is now fixed for me with your improvement. You can keep the issue open if you like as a reminder for further improvements or just create a new one as follow up addressing the problem you described.

Thanks for the great work!

mrylmz avatar Jul 08 '21 09:07 mrylmz

I think the problem currently is that nobody has really looked at the handle stuff in detail, I just did whatever was simplest. So it's kind of thrown in there and there may be inefficiency in there right now that should be avoided. Primarily, CreatePipe() seems kind of broken on Windows, so I think using CreateNamedPipe instead to manually create the anonymous pipes for the non-fast-pipe case would probably be the right way to go. It's not really my area of expertise, unfortunately - this is the first time I've ever tried using pipes on Windows :) Usually for interproc you just do shared memory or sockets, etc.

So I'll leave this open for now just so we don't forget to look at it eventually and see what's up.

- Casey

cmuratori avatar Jul 10 '21 07:07 cmuratori

That is exactly what that handle is. It is event that will be set to signaled when Present won't block.

Just want to add a little more detail in case someone stumbles upon this ticket and misinterprets that sentence - that statement isn't entirely correct afaiu. It behaves kind of like a semaphore. It's released once when the swap chain releases one buffer back to your application, and your application is supposed to wait on it exactly once per buffer that you wish to submit via Present.

For example, if you were to wait for it multiple times in direct succession in refterm's code, it would only succeed once (the subsequent calls would time out, or hang indefinitely if no timeout is specified).

In the case of regular double buffering as used in refterm's swap chain, that boils down to exactly one successful (!) wait per call to Present (and that in turn would happen exactly once per vblank if the renderer is fast enough).

If you have more buffers, its usage is more nuanced (think video playback).

bplu4t2f avatar Jul 17 '21 14:07 bplu4t2f

If that's the case, it may be used incorrectly in refterm:

https://github.com/cmuratori/refterm/blob/91e932f011e12c02a6c609ac59570f5c19fe4727/refterm_example_terminal.c#L1324

I used it "as if" it was as Mārtiņš said, meaning that it's just a handle I can check to see if it has been signaled or not. This should probably be fixed?

- Casey

cmuratori avatar Jul 17 '21 18:07 cmuratori

You break out of the loop on a successful (!) wait, and then you don't wait again until the Present call as far as I can tell. If the WaitForSingleObject call times out, it doesn't actually "take" the semaphore.

Edit: You can easily see what's going to happen by just putting another WaitForSingleObject call after the do-while loop, with a timeout of, say, 2 seconds. You'll notice it'll hang. Assuming throttling is off, of course -- whether your Present call has a non-zero SwapInterval (for vsync) doesn't seem to make a difference.

bplu4t2f avatar Jul 17 '21 18:07 bplu4t2f

It is auto-reset event, right? If so, then is normal behavior for such event. It is set back to non-signaled state once WaitForSingleObject succeeds on it. For repeated checks probably extra boolean would be needed.

mmozeiko avatar Jul 17 '21 18:07 mmozeiko

It isn't that simple. It's not just a boolean WaitEvent like ManualResetEvent or AutoResetEvent. I don't know exactly how it works, or whether it's really a semaphore internally, but it depends on how many buffers your swap chain has.

On top of that, do note that the correct call order is

while (1)
{
   WaitForSingleObject
   -- render frame --
   Present
}

(i.e. the important bit is that it is necessary to call WaitForSingleObject before the very first Present call as well -- see here: https://docs.microsoft.com/en-us/windows/win32/api/dxgi1_3/nf-dxgi1_3-idxgiswapchain2-getframelatencywaitableobject )

This is wrong:

while (1)
{
   -- render frame --
   Present
   WaitForSingleObject
}

The bottom code will appear to work, but it will introduce one additional frame of input lag - the Present call will block while WaitForSingleObject won't block (this applies to all upcoming frames).

bplu4t2f avatar Jul 17 '21 19:07 bplu4t2f

Just out of personal interest I decided to check the object type of the "Frame Latency Waitable Object" using the undocumented NT kernel function described here: http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Type%20independed/NtQueryObject.html (note: The OBJECT_TYPE_INFORMATION struct on that site isn't correct, you'll have to add a bunch of random padding to get to 0x78 struct size, otherwise the function won't eat it)

It is in fact a semaphore. However, I was unable to query the semaphore count using NtQuerySemaphore because my process doesn't have access rights to query that semaphore and I wasn't quite sure where to even start looking to fix that, so whatever. Maybe it's possible to figure out with WinDBG, but that program causes me severe pain.

Not very relevant to the discussion in the end, but I enjoy absorbing information, maybe someone else was curious too.

bplu4t2f avatar Jul 20 '21 09:07 bplu4t2f

hmmm noticed vsink ratelimited speed is about ballpark of ms terminal. maybe that's their problem as well?

orangepizza avatar Nov 17 '21 08:11 orangepizza

hmmm noticed vsink ratelimited speed is about ballpark of ms terminal. maybe that's their problem as well?

@orangepizza No, that is not the case. Check the initial issue in windows terminal repository where I made profiling on their code. There is huge amount time spent in other parts of code before any rendering code happens.

mmozeiko avatar Nov 17 '21 19:11 mmozeiko