Optimizing pl_mpeg
Hi Dominic, I recently got approved for a grant from NLnet and one of the projects I proposed was to optimize the pl_mpeg player to make it suitable for use on constrained devices (e.g. Microcontrollers). I've forked it and will be sharing my optimizations as I make them. Besides speed optimizations, I will make some #ifdef changes for the different memory challenges on MCUs (e.g. PSRAM).
Very cool! I can't promise to merge any of your changes (if that's even desired), but I'll certainly have a look.
For performance optimizations, have a look at the n64 port of pl_mpeg in libdragon: https://github.com/DragonMinded/libdragon/tree/preview/src/video – from what I understand they added a bunch of general optimizations for bit reading and VLC lookups (and of course some very n64 specific YUV conversion functions etc).
I've actually gone beyond most of the optimizations in that repo. I still have a few more to go...
One of the strategies I need to implement for MCUs is knowing which MBs have changed each frame and only drawing those. A significant amount of time on MCUs is spent converting pixel colorspace and pushing those pixels to the display. Any MBs that can be avoided per frame helps a lot.
@bitbank2 There's no issues page on your fork page so I'm messaging you here. I tried your optimized version and it's really impressive ! But currently it's unusable for me because it randomly hangs and takes 100% of the CPU, stopping decoding entirely. It happens with every mpg file I tested when seeking through it, sometimes for particular files I don't even have to seek through. I hope you can replicate this behavior
gcc pl_mpeg_player_sdl.c -o player -O3 -lSDL2
./player bjork-all-is-full-of-love.mpg
It happens with the example video in the readme too. When I click on the screen to seek through the video, it stops completely and hangs. This behavior also happens on PSVita, which is the platform I target with my app.
Thanks for letting me know - I'll figure it out.
@siteswapv4 I'm not able to reproduce your hang on the mac. I'm playing the bjork example video and seeking at the beginning to get the duration. I'm not familiar with other player programs. Can you give me the specific function to call with the specific parameter that makes it hang on the bjork video? I didn't add a seek function yet to my test player.
#define PL_MPEG_IMPLEMENTATION
#include "pl_mpeg.h"
#include <stdio.h>
int main(int argc, char* argv[])
{
plm_t* plm;
plm_frame_t* frame;
plm = plm_create_with_filename(argv[1]);
for (int i = 0; i < 100; i++)
{
printf("Loop : %d\n", i);
frame = plm_decode_video(plm);
plm_seek(plm, i, TRUE);
}
return 0;
}
This hangs forever at loop 4 for me playing bjork example I compile and run with :
gcc test.c -o test
./test bjork-all-is-full-of-love.mpg
Maybe you could open your github repo for issues also so people can report bugs better later @bitbank2
Hope you can reproduce it this time
Also I'm testing this on linux but same behavior on psvita so I don't think it's relevant
I was able to reproduce the problem on my MacOS project. It has to do with the changes I made to the buffer/file access. It gets messed up when seeking. I think I can resolve it. How much faster does it run in your tests? On the Mac it's about 230% faster than the original (so far).
I haven't been able to benchmark it yet because of the crash issue but on desktop my whole game went from 25% to 20% single core usage just with that. I'd have to run tests on the Vita where it's most relevant.
@bitbank2 Just tested your version on psvita (without seeking) and I can play videos with a bitrate 2 to 3 times higher than before (I disable audio and only play video), so yup the improvement is crazy lol
I fixed the problem, but discovered something in the design that I really don't like. Because of the way it seems to seek audio and video separately, it gums up the buffer such that it reallocates the file buffer to 2x size mid-decode. I will push my fix for you to try, but I would like to redesign the logic to not ever use memrealloc().
@bitbank2 Completely fixed, both the seek issue and the random crashes I was getting, tested with the same files that were failing before. Thanks a lot !
Do you think you can push the optimizations further or are you about done ?
I'm at the point where the next set of optimizations would require breaking the code and then fixing the breakage. Most of the hot spots are in already in good shape. One option I've tested is to have a macroblock callback function instead of frame. This works on some videos, but not all because the switch between showing a B frame or an F frame mess things up. It's a dramatic speedup because redrawing the entire frame every time is wasteful because most pixels don't change. The problem is also that dramatic movement will cause a worst-case scenario, so optimizing for slim frame changes can lead to uneven playback load. Here's the main profile so far. The code taking the most time is doing the necessary macroblock decoding/copying/moving.
I see ! Well the improvement is already incredible right now. I'll take a peek at your fork from time to time but right now there's nothing more I can ask for the PSVita, it decodes up to native resolution (960x540) 3000K video bitrate now, thanks
There are plenty of places in the current code where some 128-bit SIMD would help significantly, but that's a little beyond the scope of what I'm doing. I wanted to maintain a pure C project that can compile on any target CPU (32-bits or better).
If SIMD is to be considered, the suggestions would be SSE2 for X86(_64) and NEON for ARM. Virtually every CPU launched in the past 20 years support them. And you can always hide them behind preprocessor directives.
(I don't know much about implementing SIMD, just wanted to point that out)
Yup same, if SIMD make the thing faster I'm all in, but not at the cost of compatibility with most CPUs I know multiple libs (like cglm for matrices) that use SIMD and can compile fine with header only on all architectures I target so I'm sure it's possible though
There's also "SIMD everywhere" that exists, though I've never actually used SIMD myself, just checked the github page it seems promising ? https://github.com/simd-everywhere/simde
I can add SIMD for x86 and Arm NEON easily and in a way that doesn't break it for generic C targets. I had assumed that pl_mpeg was only used on "strange/odd" targets because x86/Arm would have optimized playback in some other form already. If you're saying that your use of pl_mpeg is on an x86 or Arm CPU with SIMD capability, then sure, I'm in!
PSVita has an ARM Cortex 9A which supports NEON SIMD I'm all in ! pl_mpeg is used on every platform possible from what I could see, from N64 to Switch, on desktop etc... It's just so easy to put everywhere there are no deps to manage
I've already begun and the NEON code is "erasing" the time spent in the macroblock processing. >100ms turns into 8ms with SIMD: This could lead to another 20+% overall speedup. What code do you use for converting the YCbCr->RGB? I can add SIMD for that too.
I'm not converting anything I use SDL2 (now 3 but same behavior) and I directly update a YUV texture. The SDL2 only example in the pl_mpeg github is from me if you wanna take a look.
wow great !
I did a test by compiling pl_mpeg_extract_frames before and after the optimizations (as of 2bb3b474b82f5ebc76ee5eb4717e16c422b7c887), using it on a 4k 60fps video, then re-encoding it again.
Before: https://0x0.st/s/SyCGgHGrTtYPHlMHMVSlFg/88wG.mp4
After: https://0x0.st/s/uJT7rguTFuNecxx2FDyvmA/88wD.mp4
Outside pl_mpeg_extract_frames or on less demanding videos (1080p or less, 24-30fps) the decoded result is a bit less broken. This can be observed since the first commit that implemented the fast VLC.
I get that the optimization process isn't finished yet, but I thought pointing this out could help in some way.
Despite this, the performance gains are quite impressive. Even Theora with its SSE2 can't stand against.
I did a test by compiling
pl_mpeg_extract_framesbefore and after the optimizations (as of 2bb3b474b82f5ebc76ee5eb4717e16c422b7c887), using it on a 4k 60fps video, then re-encoding it again.Before: https://0x0.st/s/SyCGgHGrTtYPHlMHMVSlFg/88wG.mp4
After: https://0x0.st/s/uJT7rguTFuNecxx2FDyvmA/88wD.mp4
Outside pl_mpeg_extract_frames or on less demanding videos (1080p or less, 24-30fps) the decoded result is a bit less broken. This can be observed since the first commit that implemented the fast VLC.
I get that the optimization process isn't finished yet, but I thought pointing this out could help in some way.
Despite this, the performance gains are quite impressive. Even Theora with its SSE2 can't stand against.
Thanks for the info. My initial changes reduced the frequency of checking the buffer state. The original code was checking for data available after each VLC decode. I need to implement a highwater mark check at the start of each macroblock decode to fix this.
@siteswapv4 I just pushed a change which adds some NEON SIMD and the overall video decode speedup is about 21% compared to the previous build. I will now work on fixing the stream issues since removing some of the "has enough" checks.
It doesn't compile on PSVita right now because
expected 'uint16x8_t' but argument is of type 'uint8x16_t'