circle icon indicating copy to clipboard operation
circle copied to clipboard

2DGraphics vsync is slow

Open stephaneweg opened this issue 1 year ago • 13 comments

Hi,

i have a little issue , and i dont know what is the cause. in my kernel i use 2DGraphics ; when i use the vsync option, it is verry slow, like 1 FPS , even if i draw a single rectangle ; it seems that the slow part is when it does "m_pFrameBuffer->WaitForVerticalSync();"

when i does not use the vsync option, it does a memcpy; and it's faster.

i what could be the cause of that ?

best regards

stephaneweg avatar Oct 14 '24 19:10 stephaneweg

With which RPi model do you work and with which display? Have you tested this with sample/41-screenanimations?

rsta2 avatar Oct 15 '24 09:10 rsta2

I have tested the sample 41 with the same result (i am using an RPI 4B and a standard HDMI display) but i think i solved my problem => in the config file, i now disabled the options to load dtoverlay (dtoverlay=vc4-kms-v3d)

it works better, but it still stuck a 30hz

stephaneweg avatar Oct 15 '24 12:10 stephaneweg

Then clearing the screen an the drawing takes too long to reach 60 Hz. You may have to reduce the screen size (width, height), if it's important for you to have 60 Hz.

rsta2 avatar Oct 15 '24 17:10 rsta2

There is an update on the develop branch, which speeds up operation very much. The class C2DGraphics can be used on the Raspberry Pi 5 now too, but without vertical sync support.

rsta2 avatar Oct 28 '24 18:10 rsta2

It's great that it's sped up. I'm wondering if you can explain to me why drawing to a third buffer and using memcpy to update the second buffer before swapping is faster than just writing directly to the second in the first place. Clearly it is, I applied your fix to my copy and it works, but I don't understand why.

KyleCardoza avatar Nov 25 '24 16:11 KyleCardoza

The reason is the data cache. The buffer, where the drawing operations will be done, is in cached memory, while the frame buffer memory is not cached. So drawing is much faster in cached memory and memcpy() to uncached memory is relatively quick, because it uses strictly increasing addresses and word access.

rsta2 avatar Nov 25 '24 20:11 rsta2

That makes sense, thank you. One more question if you don’t mind. Would DMA from the cached buffer to the framebuffer be faster than memcpy(), or is it about even?

KyleCardoza avatar Nov 26 '24 04:11 KyleCardoza

You are welcome. I haven't made benchmarks on this, but maybe. I'm working on a new general display interface for Circle. With this the class C2DGraphics will use DMA to copy the internal display buffer to the frame buffer.

C2DGraphics will work with logical colors (RGB888) then, so that it can be used on any display, which supports the new CDisplay interface. This will require some small modifications in applications. The current status of this new display support is on the branch general-display-interface in the Circle repository. See #380 for more info.

rsta2 avatar Nov 26 '24 08:11 rsta2

Very cool. What I've done on my end is take the current main branch C2DGraphics class and extensively modified it into a project specific class (two classes, actually, one that owns the framebuffer and one that deals with drawing to an arbitrary memory buffer) that supports clipping rectangles and alpha blending; I will try making my screen class use DMA to update the back buffer and see if that is better, worse, or the same.

KyleCardoza avatar Nov 26 '24 15:11 KyleCardoza

Okay, I have implemented DMA write for updating the back-buffer from the cached draw buffer, and it's at least a little faster with a burst argument of zero; however, if I goose the burst argument up to 10, I pulled ~60fps at 1080p. I'm not sure how big a burst argument it can handle without causing bus problems, though. 16 crashes it at boot. I would imagine with more activity on all the CPU cores, bus contention with DMA gets worse?

KyleCardoza avatar Nov 26 '24 15:11 KyleCardoza

Great. Yes, the burst parameter has a big influence. It depends on the other things, which were running on the bus, how this parameter can be set. I wouldn't use values greater than 2 generally, but of course you can tune this for your application. There is an "assert (nBurstLength <= 15)" in the DMA driver, so it cannot be greater than 15.

rsta2 avatar Nov 26 '24 17:11 rsta2

I presently have it at 5, but I will back it off to 2; I just wanted to see the limits. I've implemented a limiter when running at 1080p, so it stays at a locked 30fps; 960x540, the default resolution I chose, runs 60fps even without burst.

KyleCardoza avatar Nov 26 '24 18:11 KyleCardoza

Good to know.

rsta2 avatar Nov 26 '24 22:11 rsta2