pl_mpeg icon indicating copy to clipboard operation
pl_mpeg copied to clipboard

Optimizing pl_mpeg

Open bitbank2 opened this issue 11 months ago • 41 comments

Hi Dominic, I recently got approved for a grant from NLnet and one of the projects I proposed was to optimize the pl_mpeg player to make it suitable for use on constrained devices (e.g. Microcontrollers). I've forked it and will be sharing my optimizations as I make them. Besides speed optimizations, I will make some #ifdef changes for the different memory challenges on MCUs (e.g. PSRAM).

bitbank2 avatar Jan 17 '25 16:01 bitbank2

Very cool! I can't promise to merge any of your changes (if that's even desired), but I'll certainly have a look.

For performance optimizations, have a look at the n64 port of pl_mpeg in libdragon: https://github.com/DragonMinded/libdragon/tree/preview/src/video – from what I understand they added a bunch of general optimizations for bit reading and VLC lookups (and of course some very n64 specific YUV conversion functions etc).

phoboslab avatar Jan 19 '25 19:01 phoboslab

I've actually gone beyond most of the optimizations in that repo. I still have a few more to go...

bitbank2 avatar Jan 20 '25 05:01 bitbank2

One of the strategies I need to implement for MCUs is knowing which MBs have changed each frame and only drawing those. A significant amount of time on MCUs is spent converting pixel colorspace and pushing those pixels to the display. Any MBs that can be avoided per frame helps a lot.

bitbank2 avatar Jan 21 '25 10:01 bitbank2

@bitbank2 There's no issues page on your fork page so I'm messaging you here. I tried your optimized version and it's really impressive ! But currently it's unusable for me because it randomly hangs and takes 100% of the CPU, stopping decoding entirely. It happens with every mpg file I tested when seeking through it, sometimes for particular files I don't even have to seek through. I hope you can replicate this behavior

gcc pl_mpeg_player_sdl.c -o player -O3 -lSDL2
./player bjork-all-is-full-of-love.mpg

It happens with the example video in the readme too. When I click on the screen to seek through the video, it stops completely and hangs. This behavior also happens on PSVita, which is the platform I target with my app.

siteswapv4 avatar Jan 28 '25 15:01 siteswapv4

Thanks for letting me know - I'll figure it out.

bitbank2 avatar Jan 28 '25 16:01 bitbank2

@siteswapv4 I'm not able to reproduce your hang on the mac. I'm playing the bjork example video and seeking at the beginning to get the duration. I'm not familiar with other player programs. Can you give me the specific function to call with the specific parameter that makes it hang on the bjork video? I didn't add a seek function yet to my test player.

bitbank2 avatar Jan 28 '25 20:01 bitbank2

#define PL_MPEG_IMPLEMENTATION
#include "pl_mpeg.h"

#include <stdio.h>

int main(int argc, char* argv[])
{
	plm_t* plm;
	plm_frame_t* frame;

	plm = plm_create_with_filename(argv[1]);
	
	for (int i = 0; i < 100; i++)
	{
		printf("Loop : %d\n", i);
		frame = plm_decode_video(plm);
		
		plm_seek(plm, i, TRUE);
	}
	
	return 0;
}

This hangs forever at loop 4 for me playing bjork example I compile and run with :

gcc test.c -o test
./test bjork-all-is-full-of-love.mpg 

siteswapv4 avatar Jan 28 '25 22:01 siteswapv4

Maybe you could open your github repo for issues also so people can report bugs better later @bitbank2

Hope you can reproduce it this time

siteswapv4 avatar Jan 28 '25 22:01 siteswapv4

Also I'm testing this on linux but same behavior on psvita so I don't think it's relevant

siteswapv4 avatar Jan 28 '25 22:01 siteswapv4

I was able to reproduce the problem on my MacOS project. It has to do with the changes I made to the buffer/file access. It gets messed up when seeking. I think I can resolve it. How much faster does it run in your tests? On the Mac it's about 230% faster than the original (so far).

bitbank2 avatar Jan 29 '25 01:01 bitbank2

I haven't been able to benchmark it yet because of the crash issue but on desktop my whole game went from 25% to 20% single core usage just with that. I'd have to run tests on the Vita where it's most relevant.

siteswapv4 avatar Jan 29 '25 06:01 siteswapv4

@bitbank2 Just tested your version on psvita (without seeking) and I can play videos with a bitrate 2 to 3 times higher than before (I disable audio and only play video), so yup the improvement is crazy lol

siteswapv4 avatar Jan 29 '25 11:01 siteswapv4

I fixed the problem, but discovered something in the design that I really don't like. Because of the way it seems to seek audio and video separately, it gums up the buffer such that it reallocates the file buffer to 2x size mid-decode. I will push my fix for you to try, but I would like to redesign the logic to not ever use memrealloc().

bitbank2 avatar Jan 29 '25 15:01 bitbank2

@bitbank2 Completely fixed, both the seek issue and the random crashes I was getting, tested with the same files that were failing before. Thanks a lot !

siteswapv4 avatar Jan 29 '25 15:01 siteswapv4

Do you think you can push the optimizations further or are you about done ?

siteswapv4 avatar Jan 29 '25 16:01 siteswapv4

I'm at the point where the next set of optimizations would require breaking the code and then fixing the breakage. Most of the hot spots are in already in good shape. One option I've tested is to have a macroblock callback function instead of frame. This works on some videos, but not all because the switch between showing a B frame or an F frame mess things up. It's a dramatic speedup because redrawing the entire frame every time is wasteful because most pixels don't change. The problem is also that dramatic movement will cause a worst-case scenario, so optimizing for slim frame changes can lead to uneven playback load. Here's the main profile so far. The code taking the most time is doing the necessary macroblock decoding/copying/moving.

Image

bitbank2 avatar Jan 29 '25 19:01 bitbank2

I see ! Well the improvement is already incredible right now. I'll take a peek at your fork from time to time but right now there's nothing more I can ask for the PSVita, it decodes up to native resolution (960x540) 3000K video bitrate now, thanks

siteswapv4 avatar Jan 29 '25 20:01 siteswapv4

There are plenty of places in the current code where some 128-bit SIMD would help significantly, but that's a little beyond the scope of what I'm doing. I wanted to maintain a pure C project that can compile on any target CPU (32-bits or better).

bitbank2 avatar Jan 29 '25 20:01 bitbank2

If SIMD is to be considered, the suggestions would be SSE2 for X86(_64) and NEON for ARM. Virtually every CPU launched in the past 20 years support them. And you can always hide them behind preprocessor directives.

(I don't know much about implementing SIMD, just wanted to point that out)

DeeJayLSP avatar Jan 30 '25 02:01 DeeJayLSP

Yup same, if SIMD make the thing faster I'm all in, but not at the cost of compatibility with most CPUs I know multiple libs (like cglm for matrices) that use SIMD and can compile fine with header only on all architectures I target so I'm sure it's possible though

siteswapv4 avatar Jan 30 '25 09:01 siteswapv4

There's also "SIMD everywhere" that exists, though I've never actually used SIMD myself, just checked the github page it seems promising ? https://github.com/simd-everywhere/simde

siteswapv4 avatar Jan 30 '25 09:01 siteswapv4

I can add SIMD for x86 and Arm NEON easily and in a way that doesn't break it for generic C targets. I had assumed that pl_mpeg was only used on "strange/odd" targets because x86/Arm would have optimized playback in some other form already. If you're saying that your use of pl_mpeg is on an x86 or Arm CPU with SIMD capability, then sure, I'm in!

bitbank2 avatar Jan 30 '25 10:01 bitbank2

PSVita has an ARM Cortex 9A which supports NEON SIMD I'm all in ! pl_mpeg is used on every platform possible from what I could see, from N64 to Switch, on desktop etc... It's just so easy to put everywhere there are no deps to manage

siteswapv4 avatar Jan 30 '25 13:01 siteswapv4

I've already begun and the NEON code is "erasing" the time spent in the macroblock processing. >100ms turns into 8ms with SIMD: This could lead to another 20+% overall speedup. What code do you use for converting the YCbCr->RGB? I can add SIMD for that too.

Image

bitbank2 avatar Jan 30 '25 13:01 bitbank2

I'm not converting anything I use SDL2 (now 3 but same behavior) and I directly update a YUV texture. The SDL2 only example in the pl_mpeg github is from me if you wanna take a look.

siteswapv4 avatar Jan 30 '25 15:01 siteswapv4

wow great !

siteswapv4 avatar Jan 30 '25 15:01 siteswapv4

I did a test by compiling pl_mpeg_extract_frames before and after the optimizations (as of 2bb3b474b82f5ebc76ee5eb4717e16c422b7c887), using it on a 4k 60fps video, then re-encoding it again.

Before: https://0x0.st/s/SyCGgHGrTtYPHlMHMVSlFg/88wG.mp4

After: https://0x0.st/s/uJT7rguTFuNecxx2FDyvmA/88wD.mp4

Outside pl_mpeg_extract_frames or on less demanding videos (1080p or less, 24-30fps) the decoded result is a bit less broken. This can be observed since the first commit that implemented the fast VLC.

I get that the optimization process isn't finished yet, but I thought pointing this out could help in some way.

Despite this, the performance gains are quite impressive. Even Theora with its SSE2 can't stand against.

DeeJayLSP avatar Jan 31 '25 03:01 DeeJayLSP

I did a test by compiling pl_mpeg_extract_frames before and after the optimizations (as of 2bb3b474b82f5ebc76ee5eb4717e16c422b7c887), using it on a 4k 60fps video, then re-encoding it again.

Before: https://0x0.st/s/SyCGgHGrTtYPHlMHMVSlFg/88wG.mp4

After: https://0x0.st/s/uJT7rguTFuNecxx2FDyvmA/88wD.mp4

Outside pl_mpeg_extract_frames or on less demanding videos (1080p or less, 24-30fps) the decoded result is a bit less broken. This can be observed since the first commit that implemented the fast VLC.

I get that the optimization process isn't finished yet, but I thought pointing this out could help in some way.

Despite this, the performance gains are quite impressive. Even Theora with its SSE2 can't stand against.

Thanks for the info. My initial changes reduced the frequency of checking the buffer state. The original code was checking for data available after each VLC decode. I need to implement a highwater mark check at the start of each macroblock decode to fix this.

bitbank2 avatar Jan 31 '25 08:01 bitbank2

@siteswapv4 I just pushed a change which adds some NEON SIMD and the overall video decode speedup is about 21% compared to the previous build. I will now work on fixing the stream issues since removing some of the "has enough" checks.

bitbank2 avatar Jan 31 '25 11:01 bitbank2

It doesn't compile on PSVita right now because

expected 'uint16x8_t' but argument is of type 'uint8x16_t'

siteswapv4 avatar Jan 31 '25 12:01 siteswapv4