retro-go icon indicating copy to clipboard operation
retro-go copied to clipboard

ESP32-P4 port

Open fcipaq opened this issue 3 months ago • 56 comments

Hi,

first of all: thank you very much for you great work and the excellent porting guide!

I'm currently porting Retro-Go to the ESP32-P4. I'm using my open-source DIY ESP32-P4 handheld (if you're interested you can find an introduction here: https://www.youtube.com/watch?v=5BC4VqG1UPU). Now, the RG graphics hardware abstraction is focuses on transmitting a buffer to an LCD with GRAM and I can see why. However, my handheld has a MIPI-Display that relies on timing-sensitive buffer transmission so I had to make some changes to rg_display.c Further changes may be necessary in the future to increase emulation speed (I already ported Snex9x only from the retro-go repo and it runs many games at full speed).

Now, here's my question: Is there general interest in merging these changes? (My repo is currently private, but I'm more than happy to grant access if you'd like to see what I modified).

fcipaq avatar Sep 06 '25 20:09 fcipaq

Someone shared your video in another issue/PR related to the P4, happy to see you here!

As far as adding a new display driver and new device target, I typically accepts anything as long as: the drivers be self contained as much as possible, and the device be available for purchase or that build instructions be available.

But merging significant changes to core retro-go files we'd likely have to discuss it and iterate on it in a PR. Alternatively if you don't want this long process you can just show me your code privately so I can better understand your needs and I'm sure I'd be able to make some tweaks to at least make your life easier! Things like adding new hooks or ways to bypass part of the rg_display code or inject code from the driver file should be no problem at all.

Or maybe you just need first-class esp_lcd support in retro-go? Because I have a stash somewhere in my local tree doing just that. It was never merged because, iirc, it didn't work well (at least with SPI). It frequently crashed due to bus conflicts with the SD card and it also didn't have an API to set low level things like gamma or voltage curves or at least issue the raw commands so it looked like crap. Also I think the draw/send function worked by copy so it created twice the memory transfers. But it's been a while, maybe it improved?

And finally, any improvement to snes9x or other emulators I would likely accept without any question. :)

ducalex avatar Sep 09 '25 22:09 ducalex

Echo the affirmation of general interest in this, I've been enjoying watching retro-go undergo its current renaissance as people contribute to helping @ducalex generalise out its functionality to more hardware/use cases.

HeathenUK avatar Sep 10 '25 07:09 HeathenUK

Someone shared your video in another issue/PR related to the P4, happy to see you here!

I had no idea there was going on so much with the P4, great to hear that! Next to the things you've already discussed, I also put some of the hot paths in IRAM, especially the CPU Ops. Moreover, GFX.Screen, GFX.SubScreen, GFX.ZBuffer and GFX.SubZBuffer need to be in IRAM or that'll hurt the performance big time. I ran out of RAM so I reduced the tile cache to have more RAM available and also GFX.SubScreen = GFX.Screen and to my surprise all games I tried didn't show any glitches but my understanding of Snes9x internals is too little to really get all the implications. Things done by the 2nd core won't eat into CPU0's performance, however, it will use some bandwidth (especially true for PSRAM) and increase the probability of cache misses.

As far as adding a new display driver and new device target, I typically accepts anything as long as: the drivers be self contained as much as possible, and the device be available for purchase or that build instructions be available.

Okay, I'm planning on fully open-sourcing it so that everyone can build one. :) Regarding the containment of the drivers: Currently, the app's performance is significantly impacted by the creation of an "rg_surface" buffer, which is then processed by rg_display and eventually submitted to the display. And that is a formidable way of doing it when you have an SPI/MCU attached. However, with a DSI or RGB display connected, the preferred approach is probably for the app to create two surfaces for double buffering. Now, one could modify the apps using a double-buffering approach only when using a DSI/RGB display. In Snes9x, I see you already had something like that in mind with static rg_surface_t *updates[2]. Alternatively, the apps could query the surface's data buffer pointer after submission to access the other framebuffer like GFX.Screen = rg_surface_get_buffer(my_surface), effectively flipping the buffer. It would probably only be a couple lines of code that would need to be changed. I'll check that out.

But merging significant changes to core retro-go files we'd likely have to discuss it and iterate on it in a PR. Alternatively if you don't want this long process you can just show me your code privately so I can better understand your needs and I'm sure I'd be able to make some tweaks to at least make your life easier! Things like adding new hooks or ways to bypass part of the rg_display code or inject code from the driver file should be no problem at all.

Btw.: The code is already public (apparently you cannot have a private repo that was forked from a public one)

Or maybe you just need first-class esp_lcd support in retro-go? Because I have a stash somewhere in my local tree doing just that. It was never merged because, iirc, it didn't work well (at least with SPI). It frequently crashed due to bus conflicts with the SD card and it also didn't have an API to set low level things like gamma or voltage curves or at least issue the raw commands so it looked like crap. Also I think the draw/send function worked by copy so it created twice the memory transfers. But it's been a while, maybe it improved?

You are right about that. The original driver copies everything upon writing, which is very slow. I modified it so that any given buffer can be scanned out. The supplied buffer does not have to match the screen resolution. So if you're, say, using a 720x720 screen (like I do) you can still have a 320x240 buffer submitted. You can then have the image integer scaled and rotated. However, this is not ideal either since the CPU does the heavy lifting and it takes up quite some memory bandwidth as well. Maybe this is a good job for the ESP32-P4's "2D graphics accelerator" - had no time to look into it.

And finally, any improvement to snes9x or other emulators I would likely accept without any question. :)

Great! It's a great project. You and the community have really achieved something! And I'd be more than happy to contribute if I can. Question is, what would be the best way to collaborate? If you - or someone else - wants to help port this to the "Picoheld 2" - I have a spare prototype lying around that waits to be tinkered with :) - it only has a plastic shell, but I'd be happy to give it away (for free :) ).

fcipaq avatar Sep 10 '25 19:09 fcipaq

Hey, I saw you used the master branch but the most recent changes are on the dev branch, make sure to check it out ;) Also a few of the fixes for the P4 have been done in here (like the bootloader offset, the 4bit SDMMC with the on chip LDO, the L/R/X/Y support on SNES9X, and the general speed improvements on SNES9X)

I saw you were using a pretty slow SDMMC speed and I had the same issue before using the on chip LDO (check the rg_storage.c esp32 p4 specifics) but I was personnaly using a devkit so maybe it won't change anything on your console.

Edit: I was also wondering what were the benefit of using such a high res MIPI display instead of the regular 320*240 over SPI ?

rapha-tech avatar Sep 10 '25 20:09 rapha-tech

Hey, yeah, you're right - the dev branch would probably have saved me some time... :) Maybe I'll find some time this weekend to check it out. Thanks for the info. And it's great to see that you even got GBA running on the ESP32-P4, great job!

Regarding the SD speed: you're right. However, I'm using an SOM (WT0132P4-A1), because it was the only option available when I started the project last year. And the datasheet is pretty vague, to say the least, so I'm not sure which, if any, of the LDOs is tied to SD_VREF.

For me, picking the hi-res screen, is all about aesthetics. It gives you a super crisp and blocky appearance, and for me, personally, I find that really appealing. I tried a 320x320 square MCU/SPI screen, but I found it looks too much like a fly screen. Just my two cents.

Do you have any further plans for the GBA emulator?

Thanks again for your suggestions!

Edit: Apparentely they just followed the reference design and SD_VREF is tied to LDO 4. Working with 40 MHz now. Thanks for the input.

fcipaq avatar Sep 11 '25 17:09 fcipaq

Great!

Thanks for the details about the display! Also, I guess you could have some nicer UI if you got the time to adapt it for your screen...

For the GBA emulator it was actually an older branch made by ducalex, I simply merged it into the dev branch for testing with the P4.

I'm not a very good programmer when it come to optimizing emulators this is mostly ducalex's work ;)

rapha-tech avatar Sep 13 '25 08:09 rapha-tech

Moreover, GFX.Screen, GFX.SubScreen, GFX.ZBuffer and GFX.SubZBuffer need to be in IRAM or that'll hurt the performance big time.

Just to be clear with IRAM you mean Internal RAM and not Instruction RAM here? I'm asking because although espressif themselves sometimes call internal ram IRAM, IRAM usually refers to the instruction RAM which, on xtensa esp32s at least, can only be accessed in 32bit words so that won't work for those 8/16 bit buffers.

Anything that isn't explicitly allocated in internal RAM already is either because it didn't fit (most likely) or I rely on CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL to make it happen implicitly. Reason is simply that I try to avoid rg_* calls in the apps themselves to keep them more "portable". Obviously if you find that calling rg_alloc(..., MEM_FAST) explicitly from the app makes a difference I will be pragmatic about it :).

In Snes9x, I see you already had something like that in mind with static rg_surface_t *updates[2]

About half my apps use double buffering. When they don't it's always because:

  • Emulation is so slow that we have plenty of time to scale and send it over SPI before the app gets to overwrite it (due to frameskip or just not reaching 100% speed anyhow), so it doesn't actually help avoid artifacts or improve speed, it just wastes RAM.
  • The second buffer doesn't fit in internal RAM and I found that having a framebuffer in SPIRAM is, in many cases, worse than having a single framebuffer.

In the dev branch there is a define to enable double buffering and GFX.Screen is flipped every time a frame is sent to the display.

From what I'm understanding, for your device to work well you just need a way to ditch the display_task mechanism entirely and have rg_display_submit let your driver do whatever with the raw framebuffer? Because all apps with double buffering already flip the buffer after calling rg_display_submit, so I don't think any modification to the apps is required. Although many apps use paletted framebuffers (nes, sms, pce, doom, genesis) so your driver would have to handle that (it's currently handled by display_task/write_update). I'm not opposed to adding 565 renderers directly to those apps either (behind a flag, because the og esp32 currently benefits more from the smaller framebuffers than it would from a simpler blitting pipeline).

ducalex avatar Sep 13 '25 20:09 ducalex

it also didn't have an API to set low level things like gamma or voltage curves or at least issue the raw commands so it looked like crap

Also to anyone reading my earlier comment I've found that I am wrong about that, there is esp_lcd_panel_io_tx_param. But the rest of my issues with esp_lcd still seem to apply as of esp-idf 5.4 unfortunately.

ducalex avatar Sep 13 '25 20:09 ducalex

Just to be clear with IRAM you mean Internal RAM and not Instruction RAM here? I'm asking because although espressif themselves sometimes call internal ram IRAM, IRAM usually refers to the instruction RAM which, on xtensa esp32s at least, can only be accessed in 32bit words so that won't work for those 8/16 bit buffers.

Yeah, sorry, my bad. I meant internal RAM as in MALLOC_CAP_INTERNAL

Anything that isn't explicitly allocated in internal RAM already is either because it didn't fit (most likely) or I rely on CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL to make it happen implicitly. Reason is simply that I try to avoid rg_* calls in the apps themselves to keep them more "portable". Obviously if you find that calling rg_alloc(..., MEM_FAST) explicitly from the app makes a difference I will be pragmatic about it :).

The emulated RAM is usually relatively slow whereas the emulator's RAM must be fast. When relying solely on malloc, some memory may be allocated in internal RAM, even though it doesn't need to be fast. This takes up space that could be used for faster memory. For example, in memmap.c Memory.RAM is allocated at the beginning, potentially taking up all the space in internal RAM and IPPU.ScreenColors, which might be frequently accessed, won't fit into internal RAM and will be allocated in PSRAM. Therefore I would prefer to explicitly control what memory is allocated where. Moreover, the default value for CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL is 32 kB, with anything larger than that will be allocated in PSRAM. For example rg_surface_create(SNES_WIDTH, SNES_HEIGHT_EXTENDED, RG_PIXEL_565_LE, 0); would always be allocated in external memory.

In Snes9x, I see you already had something like that in mind with static rg_surface_t *updates[2]

About half my apps use double buffering. When they don't it's always because:

* Emulation is so slow that we have plenty of time to scale and send it over SPI before the app gets to overwrite it (due to frameskip or just not reaching 100% speed anyhow), so it doesn't actually help avoid artifacts or improve speed, it just wastes RAM.

* The second buffer doesn't fit in internal RAM and I found that having a framebuffer in SPIRAM is, in many cases, worse than having a single framebuffer.

In the dev branch there is a define to enable double buffering and GFX.Screen is flipped every time a frame is sent to the display.

From what I'm understanding, for your device to work well you just need a way to ditch the display_task mechanism entirely and have rg_display_submit let your driver do whatever with the raw framebuffer? Because all apps with double buffering already flip the buffer after calling rg_display_submit, so I don't think any modification to the apps is required. Although many apps use paletted framebuffers (nes, sms, pce, doom, genesis) so your driver would have to handle that (it's currently handled by display_task/write_update). I'm not opposed to adding 565 renderers directly to those apps either (behind a flag, because the og esp32 currently benefits more from the smaller framebuffers than it would from a simpler blitting pipeline).

What I'm trying is to implement a rg_display_submit_direct next to the "regular" rg_display_submit allowing a direct write to the framebuffer in SNES. I think the other emulators are less critical and may use the "regular" way as does the menu. There are still some minor issues when switching between the two modes. But it generally works. I now get 50 fps out of Super Mario World's intro with every frame being rendered, which is good I guess. To do this I had to:

  • Put two screen buffers in internal memory (and that means Subscreen = Screen and yes, that is a hack gulp)
  • Put the ZBuffer in RAM
  • Disable tile buffering (which does no good due to the slow external RAM)
  • Put the CPU Ops along some other stuff in IRAM (This time, I really mean instruction memory :) )

If you're interested you can take a look in the repo (dev branch), but it's a WIP :)

fcipaq avatar Sep 14 '25 16:09 fcipaq

Oh, and it looks like there's an issue with the audio sampling rate. The audio task's queuing limits the main thread to 50 FPS (xQueueSend(task->queue, msg, portMAX_DELAY) == pdTRUE;) The target video FPS rate, however, seems to be 60. Consequently, the video can never catch up with the set rate, which triggers massive frame dropping.

fcipaq avatar Sep 14 '25 21:09 fcipaq

It's been so long that I can't really explain my allocation decisions, though I know that I have spent hours trying every single allocation being internal or external or bss, so there must have been tradeoffs to explain the current code.

I will be trying all the changes you've suggested. Annoyingly the first step will have to be to remove snes9x from retro-core, because IRAM is full there. But if it helps performance then so be it.

For example rg_surface_create(SNES_WIDTH, SNES_HEIGHT_EXTENDED, RG_PIXEL_565_LE, 0); would always be allocated in external memory.

The last argument of rg_surface_create is an rg_alloc argument, if you put MEM_FAST instead of 0 it will try to allocating in internal RAM.

Disable tile buffering (which does no good due to the slow external RAM)

I'm surprised by this. I've actually worked a lot on the tile buffering (I think the og snes9x uses 10x the memory) because disabling it entirely was very very slow on the esp32. I guess that's the difference when you have a faster CPU!

ducalex avatar Sep 15 '25 19:09 ducalex

Oh, and it looks like there's an issue with the audio sampling rate. The audio task's queuing limits the main thread to 50 FPS (xQueueSend(task->queue, msg, portMAX_DELAY) == pdTRUE;) The target video FPS rate, however, seems to be 60. Consequently, the video can never catch up with the set rate, which triggers massive frame dropping.

Is it because I mix the entire AUDIO_BUFFER_LENGTH every frame? Which is set to accomodate 50 fps, so the audio code currently always renders a bit more than 1 frame when playing NTSC games?

ducalex avatar Sep 15 '25 19:09 ducalex

It's been so long that I can't really explain my allocation decisions, though I know that I have spent hours trying every single allocation being internal or external or bss, so there must have been tradeoffs to explain the current code.

I know exactly what you're going through :) Looking at some of my old SNES emulator code, I really wonder: What the heck was I doing there? :)

I will be trying all the changes you've suggested. Annoyingly the first step will have to be to remove snes9x from retro-core, because IRAM is full there. But if it helps performance then so be it.

Hmm... my suggestions apply to the P4. I don't really know how much benefit there would be for the S3. One would have to check it out. I still have some old code for the S3 and Mario ran at ~45 FPS (with only every other frame being rendered, so ~20 rendered frames per second) - good enough for me to be able to play through the whole game. The code is really messy, but I can upload it for you if you want. I don't think I moved any code to IRAM in the S3 version. And for the P4, retro-core still fits in IRAM, so there might be no need to split Snes9x. But one would have to check that out.

For example rg_surface_create(SNES_WIDTH, SNES_HEIGHT_EXTENDED, RG_PIXEL_565_LE, 0); would always be allocated in external memory.

The last argument of rg_surface_create is an rg_alloc argument, if you put MEM_FAST instead of 0 it will try to allocating in internal RAM.

I've made these changes in the latest commit. Maybe you could take a look and tell me what I need to change to give my code a better chance of getting successfully PRed.

Disable tile buffering (which does no good due to the slow external RAM)

I'm surprised by this. I've actually worked a lot on the tile buffering (I think the og snes9x uses 10x the memory) because disabling it entirely was very very slow on the esp32. I guess that's the difference when you have a faster CPU!

I can check the difference, maybe tomorrow, and then I can report back.

fcipaq avatar Sep 15 '25 20:09 fcipaq

Oh, and it looks like there's an issue with the audio sampling rate. The audio task's queuing limits the main thread to 50 FPS (xQueueSend(task->queue, msg, portMAX_DELAY) == pdTRUE;) The target video FPS rate, however, seems to be 60. Consequently, the video can never catch up with the set rate, which triggers massive frame dropping.

Is it because I mix the entire AUDIO_BUFFER_LENGTH every frame? Which is set to accomodate 50 fps, so the audio code currently always renders a bit more than 1 frame when playing NTSC games?

I think the issue might be: #define AUDIO_BUFFER_LENGTH (AUDIO_SAMPLE_RATE / 50 + 1) (shared.h) vs .frameTime = 1000000 / 60 (rg_system.c)

fcipaq avatar Sep 15 '25 20:09 fcipaq

Oh, one more thing: What would be the best way to determine at compile time, what app is being compiled? Because in some apps the display driver has to allocate a PSRAM frame buffer where in other apps an internal frame buffer is necessary.

fcipaq avatar Sep 15 '25 20:09 fcipaq

I think the issue might be: #define AUDIO_BUFFER_LENGTH (AUDIO_SAMPLE_RATE / 50 + 1) (shared.h) vs .frameTime = 1000000 / 60 (rg_system.c)

Okay in https://github.com/ducalex/retro-go/commit/a67f5eb9d15c417af5104817615d5a5c71e8ad2d I've now made it so that it plays samplerate/framerate samples, as it should. Let me know if it's still wrong and/or you have a better fix.

I've made these changes in the latest commit. Maybe you could take a look and tell me what I need to change to give my code a better chance of getting successfully PRed.

I don't think you've linked the repo I assume it's https://github.com/fcipaq/retro-go/tree/dev ? I'll take a look!

Oh, one more thing: What would be the best way to determine at compile time, what app is being compiled? Because in some apps the display driver has to allocate a PSRAM frame buffer where in other apps an internal frame buffer is necessary.

At compile time all we have is a string in RG_PROJECT_APP, which the preproc can't compare (I think?). If it helps you we can add RG_PROJECT_APP_<appname> so that you can then #ifdef it.

But can't you just use the surface allocated by the app directly? Surface is just a glorified pointer to an alloc of your choice after all, it can be allocated however you want on the app side.

If the buffer you have in mind is internal to the display driver then I suppose we could either:

  • Add an setting that rg_system_init will pass to rg_display_init that will go to the driver. I have a stash where rg_system_init takes a config struct which would be trivial to add settings like that. Maybe I should merge it.
  • Another option is to add an rg_display function to reallocate if needed (rg_display_set_buffer_mode(mode) or maybe rg_display_set_driver_option(enum like RG_DISPLAY_BUFFER_INTERNAL, value), called right after rg_system_init it should be fine.

ducalex avatar Sep 15 '25 21:09 ducalex

I've looked at your repo and doh I just removed double audio buffering in SNES, I guess I'll have to add it back (or wait for your code)!

Maybe you could take a look and tell me what I need to change to give my code a better chance of getting successfully PRed.

I've only skimmed but so far most of the changes seem fine.

Your changes to SNES break compilation on og esp32, (IRAM overflow). Once you mirror my change of making snes9x an independent app it should resolve the issue. I'll do it on my side and test later on (Edit: I meant that I'll patch my local version, not steal your changes and commit them here).

I've now added a timeout parameter to the task functions. I didn't like your non_blocking implementation because it doesn't really imply that the message will be lost, just that the call won't block. Timeout is a bit clearer (to me) and it matches the existing mutex functions. It also gives more control to the caller, if they're willing to wait 10ms rather than 0 or infinity.

The only thing I'm not sure how I feel about is having to carry esp_lcd. It's a bit annoying to have so much extra code and ensure it keeps working every time a new esp-idf version breaks something.

I haven't looked at what you've patched in it exactly (yet) but is there no way to accomplish the same thing from either lcddrv or dsi.h's code?

ducalex avatar Sep 16 '25 01:09 ducalex

Okay in a67f5eb I've now made it so that it plays samplerate/framerate samples, as it should. Let me know if it's still wrong and/or you have a better fix.

Are you trying to adapt the generated samples depending on how long it took to render the frame? I don't understand what you are doing with the following:

samplesRemaining += samplesPerFrame; int samples = (int)samplesRemaining; samplesRemaining -= samples;

Won't this result in always the same number of samples (since you add samplesPerFrame and then subtract samples which is always the same if you start out with samplesRemaining) - am I missing something?

My suggestion also includes a different approach to the frame dropping mechanism. Now, looking at the current implementation: Let's assume the target FPS is 60 fps. However, the CPU can only get you, say, 54 fps. With the current approach, the code checks if (elapsed > app->frameTime + 1500) and that is the case as a frame takes 18.5 ms to render. Since the threshold is 18.2 ms, the next frame will be dropped and the CPU will be much faster and will more than catch up. Moving on the frame after that, since the timer is reset, the CPU will be slow again. And this will cause every second frame to be dropped. Reducing the frame rate to only 30 fps. And that's probably not what you want (unless you are aiming for tearing free vsyncing). My approach is to have a time "bank account" where you check your balance and once you're more one frame behind, a frame will be dropped. Since the audio is double buffered (the video may get 1 frame = 1 buffer behind) without the sound getting affected.

Or are you trying with the above approach to reduce the generated samples in that case so you don't have to wait up for the audio?

I've made these changes in the latest commit. Maybe you could take a look and tell me what I need to change to give my code a better chance of getting successfully PRed.

I don't think you've linked the repo I assume it's https://github.com/fcipaq/retro-go/tree/dev ? I'll take a look!

Yes, that's the repo. Sorry for the inconvenience.

Oh, one more thing: What would be the best way to determine at compile time, what app is being compiled? Because in some apps the display driver has to allocate a PSRAM frame buffer where in other apps an internal frame buffer is necessary.

At compile time all we have is a string in RG_PROJECT_APP, which the preproc can't compare (I think?). If it helps you we can add RG_PROJECT_APP_<appname> so that you can then #ifdef it.

But can't you just use the surface allocated by the app directly? Surface is just a glorified pointer to an alloc of your choice after all, it can be allocated however you want on the app side.

I wasn't sure, for example, about the launcher, which seems to resize surfaces, and that would probably cause a crash. So, it would be necessary to inform the driver, and that would probably require a major rework. I'm trying to keep a low profile here. :)

If the buffer you have in mind is internal to the display driver then I suppose we could either:

* Add an setting that rg_system_init will pass to rg_display_init that will go to the driver. I have a stash where rg_system_init takes a config struct which would be trivial to add settings like that. Maybe I should merge it.

* Another option is to add an rg_display function to reallocate if needed (rg_display_set_buffer_mode(mode) or maybe `rg_display_set_driver_option(enum like RG_DISPLAY_BUFFER_INTERNAL, value)`, called right after rg_system_init it should be fine.

I was thinking about reallocation, too. My problem was: The driver has to decide at initialization whether to allocate the buffer in RAM or PSRAM. For some apps (like the launcher) where there's plenty of RAM available, it's better to allocate in RAM in order to prevent flicker artifacts from slow PSRAM. Telling the driver to allocate in RAM would require us to modify every single app. It might not always be possible to allocate the frame buffer in RAM by default and then release it in case we don't want it due to heap fragmentation. So, instead of making RAM the default, one would have to modify every app to tell the driver to reallocate in RAM and I thought that's something you would not like to happen. But if you agree to do so, then yes, that would be perfectly fine for me.

fcipaq avatar Sep 16 '25 19:09 fcipaq

I've looked at your repo and doh I just removed double audio buffering in SNES, I guess I'll have to add it back (or wait for your code)!

Maybe you could take a look and tell me what I need to change to give my code a better chance of getting successfully PRed.

I've only skimmed but so far most of the changes seem fine.

Your changes to SNES break compilation on og esp32, (IRAM overflow). Once you mirror my change of making snes9x an independent app it should resolve the issue. I'll do it on my side and test later on (Edit: I meant that I'll patch my local version, not steal your changes and commit them here).

Oh, that's too bad, sorry for the inconvenience. But I guess the ESP32 will benefit from that, too. BTW: How fast is Snes9x on the ESP32? On the ESP32-S3 I get 45 fps (that is 22 rendered frames per second) for Super Mario World's intro. Oh, and forget everything I said about the tile cache. You were right, it won't speed up things but might slow the emulator down. I now remember that I did this to make it fit into the Teensy 4.1's RAM (as I had not PSRAM attached). So it was not about speed but size. Sorry for the discombobulation :)

I've now added a timeout parameter to the task functions. I didn't like your non_blocking implementation because it doesn't really imply that the message will be lost, just that the call won't block. Timeout is a bit clearer (to me) and it matches the existing mutex functions. It also gives more control to the caller, if they're willing to wait 10ms rather than 0 or infinity.M

Cool.

The only thing I'm not sure how I feel about is having to carry esp_lcd. It's a bit annoying to have so much extra code and ensure it keeps working every time a new esp-idf version breaks something.

And can see why and you are absolutely right about that. Maybe I can find some way to extract the "essentials". But that's probably going to take me a while.

I haven't looked at what you've patched in it exactly (yet) but is there no way to accomplish the same thing from either lcddrv or dsi.h's code?

I think that would probably be a difficult task. I'll definitely check it out.

fcipaq avatar Sep 16 '25 19:09 fcipaq

Are you trying to adapt the generated samples depending on how long it took to render the frame? I don't understand what you are doing with the following:

samplesRemaining += samplesPerFrame; int samples = (int)samplesRemaining; samplesRemaining -= samples;

Won't this result in always the same number of samples (since you add samplesPerFrame and then subtract samples which is always the same if you start out with samplesRemaining) - am I missing something?

I think you missed that samplesPerFrame and samplesRemaining are floats. Reason is that 32040 / 50 = 640.8. So we have to either round to 640/641, or we carry the .8 for a more accurate speed overall. But anyway those changes were reverted in 845c6ea0f59c54b2bca80afac9b41a71bd052a1e. A better option might be to use 32000hz for PAL and 32040 for NTSC?

My suggestion also includes a different approach to the frame dropping mechanism.

Frame skipping in retro-go is very primitive as you've seen. At some point I kept a rolling average of frame times to adjust the auto skip which worked better but I never completed it. I don't think I've attempted anything fancier. So I would welcome any enhancement even if it's just for one app!

I wasn't sure, for example, about the launcher, which seems to resize surfaces

The launcher resizes things like images but the surface that is actually sent to the display is treated like a raw continuous uint16[] buffer by both the launcher and rg_gui (iirc).

rg_surface_create has an alloc flag, perhaps we could add a bit to say that you want the buffer in display memory. And then behind the scene that buffer can be provided by the display driver. Surfaces used as framebuffers could specify that bit whereas surfaces used for drawing will remain as is. Although that won't do any good for apps where the format isn't the exact same as the display (possibly most of them), it will have to be transformed anyhow 🤔.

So, instead of making RAM the default, one would have to modify every app to tell the driver to reallocate in RAM and I thought that's something you would not like to happen. But if you agree to do so, then yes, that would be perfectly fine for me.

I've actually merged my config struct yesterday. It's still a work in progress and I don't know how many things I want to include in it, but it already contains mallocAlwaysInternal so why not add another one to dictate display driver allocation if it helps you? I think it's a better solution than reallocating after the fact because as you've said, our tiny heap is very easy to fragment.

https://github.com/ducalex/retro-go/blob/dev/components/retro-go/rg_system.h#L136-L149

ducalex avatar Sep 16 '25 20:09 ducalex

BTW: How fast is Snes9x on the ESP32? On the ESP32-S3 I get 45 fps (that is 22 rendered frames per second) for Super Mario World's intro.

As of 327a588a092167c290137c6d0405d970e1e13617 I get around 50fps (8 being rendered) in Super Mario World (NTSC).

As a reminder the numbers are: `FPS:ticks (skipped+full+partial)` where ticks-skipped = rendered frames.

[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:0%, FPS:0 (0+0+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:98%, FPS:55 (42+13+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:52%, FPS:61 (46+15+0), BATT:4142
[info] system_monitor_task: Reduced frameskip to 2
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:93%, FPS:56 (37+19+0), BATT:4142
[info] system_monitor_task: Raised frameskip to 3
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:84%, FPS:55 (40+14+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:98%, FPS:57 (43+11+2), BATT:4142
[info] system_monitor_task: Raised frameskip to 4
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:85%, FPS:54 (43+7+5), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:40 (32+6+2), BATT:4142
[info] system_monitor_task: Raised frameskip to 5
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:47 (39+8+0), BATT:4142  <---- Intro starts playing here
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:50 (42+4+4), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:49 (41+5+3), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:50 (42+6+2), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:49 (41+3+5), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:49 (41+4+4), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:97%, FPS:53 (44+9+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:96%, FPS:49 (40+8+0), BATT:4142


Running at 320Mhz:

[warn] rg_system_set_overclock: Overclock level 4 applied: 319Mhz
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:0%, FPS:0 (0+0+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:91%, FPS:62 (47+15+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:54%, FPS:62 (47+16+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:42%, FPS:60 (45+15+0), BATT:4142
[info] system_monitor_task: Reduced frameskip to 2
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:79%, FPS:61 (41+20+0), BATT:4142
[info] system_monitor_task: Reduced frameskip to 1
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:78%, FPS:55 (27+27+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:60 (31+31+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:97%, FPS:60 (30+12+17), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:97%, FPS:47 (23+23+1), BATT:4142
[info] system_monitor_task: Raised frameskip to 2
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:36 (24+9+4), BATT:4142
[info] system_monitor_task: Raised frameskip to 3
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:54 (40+13+0), BATT:4142
[info] system_monitor_task: Raised frameskip to 4
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:98%, FPS:59 (47+12+0), BATT:4142   <---- Intro starts playing here
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:100%, FPS:57 (45+5+5), BATT:4130
[info] system_monitor_task: Raised frameskip to 5
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:95%, FPS:62 (53+5+5), BATT:4130
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:92%, FPS:58 (47+8+1), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:95%, FPS:63 (54+3+8), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:93%, FPS:58 (47+5+4), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:85%, FPS:63 (53+11+0), BATT:4142
[debug] STACK:5076, HEAP:51+774 (27+768), BUSY:86%, FPS:59 (48+10+0), BATT:4142

ducalex avatar Sep 16 '25 22:09 ducalex

I think you missed that samplesPerFrame and samplesRemaining are floats. Reason is that 32040 / 50 = 640.8. So we have to either round to 640/641, or we carry the .8 for a more accurate speed overall. But anyway those changes were reverted in 845c6ea. A better option might be to use 32000hz for PAL and 32040 for NTSC?

Ah, thanks! Regarding the frequency, I'm not really sure if that matters at all, I guess these small jitters won't even be noticeable.

Frame skipping in retro-go is very primitive as you've seen. At some point I kept a rolling average of frame times to adjust the auto skip which worked better but I never completed it. I don't think I've attempted anything fancier. So I would welcome any enhancement even if it's just for one app!

Ok, then let's go with my approach for now, though it's far from fancy, either :)

The launcher resizes things like images but the surface that is actually sent to the display is treated like a raw continuous uint16[] buffer by both the launcher and rg_gui (iirc).

rg_surface_create has an alloc flag, perhaps we could add a bit to say that you want the buffer in display memory. And then behind the scene that buffer can be provided by the display driver. Surfaces used as framebuffers could specify that bit whereas surfaces used for drawing will remain as is. Although that won't do any good for apps where the format isn't the exact same as the display (possibly most of them), it will have to be transformed anyhow 🤔.

That sounds like a good idea. The display driver can handle buffers that are not the same as the screen resolution with (black bezels). So in the end there is always one single surface that gets send to the display? Because I have experienced issues with only portions of the screen being sent (but this might have been me not yet fully understanding all paths in rg_display.c). I think this might be because windows are being used for the SPI displays, sending one portion of the the screen at a given time?

Speaking of sound, I'm having issues with the sound. There are some faint crackling noises. You might not hear it right away but they're there and once you hear them, you will notice them all the time. I checked with my emulator and they're not there. So, I still have to figure out what the cause is.

I've actually merged my config struct yesterday. It's still a work in progress and I don't know how many things I want to include in it, but it already contains mallocAlwaysInternal so why not add another one to dictate display driver allocation if it helps you? I think it's a better solution than reallocating after the fact because as you've said, our tiny heap is very easy to fragment.

https://github.com/ducalex/retro-go/blob/dev/components/retro-go/rg_system.h#L136-L149

Ok, so I guess the next steps are to pull and rebase your fork (is that the proper terminology?) and then I'll try to make my commits on top of that (minus the ESP_LCD stuff for now). Unfortunately, I'll get to it this weekend at the earliest.

fcipaq avatar Sep 17 '25 18:09 fcipaq

BTW: How fast is Snes9x on the ESP32? On the ESP32-S3 I get 45 fps (that is 22 rendered frames per second) for Super Mario World's intro.

As of 327a588 I get around 50fps (8 being rendered) in Super Mario World (NTSC).

Ok, then 8 fps is what we're starting from. BTW: If you would like to avoid rg_alloc we could also allocate the memory in arrays. Edit: Since we intend to split Snes9x from the rest anyways...

fcipaq avatar Sep 17 '25 18:09 fcipaq

Hi, I tried a couple of things to improve the glue between Retro-Go and the display driver. However,I didn't really come up with a proper solution. rg_surface_submit is no problem. What is problematic is the way lcd_buffer_send (in conjunction with rg_display_write_rect etc.) works. That's why I can't just send a surface directly to the screen. It makes a back buffer that replaces the external display's GRAM mandatory. And in some scenarios there just isn't enough RAM for that.

So, what I tried is: a. to have a back buffer in SPIRAM (was too slow, causes flickering in some situations), b. triple back buffering in SPIRAM - causes data inconsistencies between core 0 and the display call back function running on core 1, which I was unable to resolve c. to have a back buffer in internal RAM on some apps - it works but has visible glitches due to lack of double buffering or is super slow if you wait for a vsync on every lcd_send_buffer.

The question is, how we're gonna deal with this. If you don't want to "just somewhat make it work", one possible solution might be: a. use the P4's graphics accelerator ("PPA") to rescale and rotate the frame buffer. If I can make this work (never used it), there would be no need to run the display on core 1 and that would help with the data inconsistencies (the display driver causes a heavy IRQ load on core 1 with an IRQ triggered for every line - that's 12,000 per second @ 50 fps!) Also, this would make the "hacked" esp_lcd driver superfluous and would be a more generic approach that basically supports all kinds of MIPI display (which I'm sure we're gonna see a lot in upcoming P4 products due to facilitated PCB routing) b. we would need to find a way in rg_display.c to mitigate direct writes to the external frame buffer so that only complete surfaces are transmitted. Writes using lcd_buffer_send are difficult to handle when using triple buffering. Perhaps something like, say, rg_blit_buffer, and - depending on the display type attached - this goes directly to the screen or otherwise goes to a buffer.

Ultimately, it all comes down to whether you want Retro-Go to support displays without external GRAM or if you prefer to keep it centered on SPI. I mean there are good reasons for both options.

That being said, Snes9x with Retro-Go runs great on the P4, I uploaded a short video here. (I hope I pronounced your name correctly, sorry if I did not).

I suggest that I start out with the commits that should speed up Snes9x on the older ESP32s. And I would commit the P4 only after it is in a better working state.

Please let me know me what you think about it.

fcipaq avatar Sep 21 '25 18:09 fcipaq

Because I have experienced issues with only portions of the screen being sent (but this might have been me not yet fully understanding all paths in rg_display.c). I think this might be because windows are being used for the SPI displays, sending one portion of the the screen at a given time?

Yeah retro-go sets a window and then sends color data to fill it. But it's not thread safe, so depending on how you're emulating the lcd_* behavior, it's easy for something else to update the window whilst the display task is still sending pixel data. I've had that issue when working on the SDL2 port.

Another thing that causes partial updates are, well, partial updates. The display task does a diff between the new frame and the previous one and sends only differing blocks. Partial updates are very helpful on the SPI side, they reduce tearing and increase the effective ramerate (otherwise we can barely reach 30fps at 40Mhz. 80Mhz is too glitchy and the APB doesn't allow us to use 60Mhz...). But it's definitely a waste of resources if your display is fast enough. If you #define RG_SCREEN_PARTIAL_UPDATES 0 it should neuter it.

I'll catch up with your branch in the upcoming days and see how your display code is currently working. I also saw that you seem to have fixed the audio issue. Was the main issue just the size of the dma buffers? If so then I could probably find a nicer way than having a target-specific value.

It's on my todo list to drop the i2s legacy driver anyway, it's been deprecated since 5.0 I think. They're not fixing bugs in it and they ought to remove it soon...

BTW: If you would like to avoid rg_alloc we could also allocate the memory in arrays. Edit: Since we intend to split Snes9x from the rest anyways...

Agreed, now that snes9x is standalone again we can reclaim the entire BSS and remove a few allocs if it helps performance, as well as put IRAM_ATTR everywhere meaningful.

If you want you can open a PR with just your snes9x improvements so we could get those merged whilst the details of the new P4 display are still being worked on.

ducalex avatar Sep 23 '25 21:09 ducalex

That being said, Snes9x with Retro-Go runs great on the P4, I uploaded a short video here. (I hope I pronounced your name correctly, sorry if I did not).

I really like your device. It's very stylish and polished and snes9x seems to run particularly well on it!

The glitches could very well be what I was suggested, that something changes the window mid frame buffer update.

In Retro-Go there are two things that draw directly to the display:

  • The frame sent via rg_display_submit
  • The gui (the dialog system, the hourglass icon, the OSD)

There is some half-assed attempt at synchronization to make sure display_task is idle before the gui sends data to the screen, but it's possible that your code bypasses it or my sync just doesn't work well in the first place.

Once I look at your code more attentively I'm sure I'll be able to know what's going on, if you don't find out before me!

As for my name I've never said it out loud myself or heard others say it before, so I guess there's no specific way of pronouncing it ;)

ducalex avatar Sep 23 '25 21:09 ducalex

Yeah retro-go sets a window and then sends color data to fill it. But it's not thread safe, so depending on how you're emulating the lcd_* behavior, it's easy for something else to update the window whilst the display task is still sending pixel data. I've had that issue when working on the SDL2 port.

Ok, that's what I thought. And that makes perfectly sense when using a display with GRAM.

Another thing that causes partial updates are, well, partial updates. The display task does a diff between the new frame and the previous one and sends only differing blocks. Partial updates are very helpful on the SPI side, they reduce tearing and increase the effective ramerate (otherwise we can barely reach 30fps at 40Mhz. 80Mhz is too glitchy and the APB doesn't allow us to use 60Mhz...). But it's definitely a waste of resources if your display is fast enough. If you #define RG_SCREEN_PARTIAL_UPDATES 0 it should neuter it.

Ok, I'll give it a try.

I'll catch up with your branch in the upcoming days and see how your display code is currently working. I also saw that you seem to have fixed the audio issue. Was the main issue just the size of the dma buffers? If so then I could probably find a nicer way than having a target-specific value.

That is what I suspected, but it turns out the DMA buffers are not the issue (but increasing them might still help with taking some load off of the CPU). My working theory for the crackling is as follows: Since the sound generation is bound to the signal (task message or semaphore) from the main thread, there is no way for the sound to get ahead of the picture. If, however the image generation falls behind, it can catch up by dropping the next frame. But the sound cannot. So the buffer will run empty and you will be able to hear a crackling. That must have been the reason why I used multiple buffers back then, effectively letting the image run ahead a couple of ms of the sound so that in case the image lags for a single frame, the sound buffers won't run empty. That being said, I just decoupled the sound generation from the main thread and to my surprise it works (I would have guessed you cannot generate sound at any point but apparently at least in some games it possible). Still, it would be better have the sound triple buffered and let the image run one frame (i.e. 17-20ms) ahead of the sound.

If you want you can open a PR with just your snes9x improvements so we could get those merged whilst the details of the new P4 display are still being worked on.

I could try to PR the changes this weekend (this is a busy week; I'm going at full speed :) ). However, if you prefer catching up yourself, please feel free to do so. It would probably be easier this way since I have no way to test it on an ESP32/ESP32-S3.

fcipaq avatar Sep 24 '25 13:09 fcipaq

I really like your device. It's very stylish and polished and snes9x seems to run particularly well on it!

Thanks, glad you like it :) I hope it'll be available for DIY building soon...

The glitches could very well be what I was suggested, that something changes the window mid frame buffer update.

In Retro-Go there are two things that draw directly to the display:

* The frame sent via rg_display_submit

* The gui (the dialog system, the hourglass icon, the OSD)

There is some half-assed attempt at synchronization to make sure display_task is idle before the gui sends data to the screen, but it's possible that your code bypasses it or my sync just doesn't work well in the first place.

Once I look at your code more attentively I'm sure I'll be able to know what's going on, if you don't find out before me!

I guess they say: "Two heads are better than one." :)

As for my name I've never said it out loud myself or heard others say it before, so I guess there's no specific way of pronouncing it ;)

Phew, that's a relief :)

fcipaq avatar Sep 24 '25 13:09 fcipaq

A little off topic, but a couple of years ago, I used a Teensy 4.1 attached to an SPI display and even though I could push it up to 65 MHz it was still too slow for 60fps. What I did is, I reduced the bitdepth from 16 bits to 12 bits (i.e. 65536 to 4096 colors). Effectively reducing the amount of data that needs to be pushed by 25%. The ILI9341 controller doesn't support it but, ST7789 does. The command is COLMOD (3Ah). Then you would transform the line before transmitting it. You could do that in lcd_send_buffer()

` for (int i = 0; i < TFT_WIDTH; i += 2) { // 1st pixel uint16_t col = tft_buffer[i * TFT_HEIGHT + (TFT_HEIGHT - scanline - 1)]; uint8_t cred1 = ((col >> 11) & 0x1f) / 2; uint8_t cgreen1 = ((col >> 5) & 0x3f) / 4; uint8_t cblue1 = ((col & 0x1f)) / 2;

// 2nd pixel
col = tft_buffer[(i + 1) * TFT_HEIGHT + (TFT_HEIGHT - scanline - 1)];
uint8_t cred2 = ((col >> 11) & 0x1f) / 2;
uint8_t cgreen2 = ((col >> 5) & 0x3f) / 4;
uint8_t cblue2 = ((col & 0x1f)) / 2;

// 1st and 2nd pixels packed
internal_line_buffer_8_dst[cnt] = cred1 * 16 + cgreen1;
cnt++;      
internal_line_buffer_8_dst[cnt] = cblue1 * 16 + cred2;
cnt++;      
internal_line_buffer_8_dst[cnt] = cgreen2 * 16 + cblue2;
cnt++;      

} `

One could argue that this decreases color accuracy a bit, but we're talking about consoles that barely exceed a color palette of 256. It might be worth a try.

fcipaq avatar Sep 24 '25 13:09 fcipaq

Hi! I made some progress over the weekend. First of all, I had to discard my previous idea of using the PPA. There only is little documentation available online, but as it seems, the PPA won't output a stream of pixels that you can forward to the display FIFO but rather processes blocks of pixels, and so the PPA needs an additional target frame buffer, and that won't do us no good. At least that's how I understood it. Now, I was able to contain the display driver and only keep the "hacked" part. The rest of the driver (like bus setup, and so on) is just the default esp_lcd driver from the IDF. It's now all contained in retro-go/drivers/display/dsi. Since I was unable to reliably make the rg_display system work, I've switched to one of your earlier suggestions:

From what I'm understanding, for your device to work well you just need a way to ditch the display_task mechanism entirely and have rg_display_submit let your driver do whatever with the raw frame buffer?

However, this also requires all apps to be slightly modified. Because then one has to pass a frame buffer to the DSI driver (all previous attempts with the driver auto-detecting these were subpar and might lead to unexpected behavior in certain situations, also it would be counterintuitive). The launcher and Snes9x are already working pretty well. But, as you've mentioned before: I still need to implement the palette lookup, amongst some other things.

Maybe you can take a look at the code and tell me what I would need to change for a PR. That would help me a lot, as I wouldn't have to manually keep up with your repository. Also, maybe it's time to come up with names for the RG_SCREEN_DRIVER values (the numbers might get confusing at times...)

fcipaq avatar Sep 29 '25 08:09 fcipaq