Facilitate RP2040 XIP-cache-as-RAM feature
The pico-sdk and RP2040 hardware provide a few facilities that improve performance by moving runtime code and data into SRAM:
- "pico/platform/sections.h" currently provides the "__not_in_flash", "__not_in_flash_func", and "__time_critical_func" macros for placing runtime code and data into SRAM by assigning them linker section names in the source code.
- The pico-sdk CMake scripts allow any of four binary types to be selected with similarly named project properties for the RP2040: "default", "blocked_ram", "copy_to_ram", or "no_flash"
- The RP2040's eXecute-In-Place (XIP) cache has its own connection to the main AHB bus and provides SRAM speeds on cache hits when retrieving runtime code and data from flash
But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations.
The RP2040's eXecute-In-Place (XIP) cache can be disabled by clearing its CTRL.EN bit which allows its 16kB of memory to be used as SRAM directly.
These changes aim to update the pico-sdk to support the following:
- Use the "__time_critical_func" macro to place runtime code into XIP RAM for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds
- Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040
- Add a new "PICO_USE_XIP_CACHE_AS_RAM" CMake property to enable the XIP cache's use as RAM for time critical instructions in PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds
- Add a couple new CMake functions "pico_sections_not_in_flash(TARGET [list_of_sources])" and "pico_sections_time_critical(TARGET [list_of_sources])" that target selected source files or a whole CMake build target's list of source files for placement into RAM and/or XIP RAM
I believe I've achieved these 4 goals, but note: I've only tested them manually with CMake based builds on the RP2040 hardware that I have. Though I have made an effort to fail fast when configuration properties are incompatible, and I've also made an effort to stay compatible with the preexisting section names and the preexisting linker scripts.
Fixes #2653
This isn't my area of expertise, but just out of curiosity...
But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations.
Does this actually increase performance, or does it just increase the amount of usable memory by moving some stuff out of main SRAM into the XIP-SRAM?
Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040
Presumably this only works for binaries less than 16kB? And for binaries that small, wouldn't they mostly end up persisting in the XIP cache anyway when run as a regular flash binary? :thinking:
Does this actually increase performance, or does it just increase the amount of usable memory by moving some stuff out of main SRAM into the XIP-SRAM?
For my application I'm running a PICO_COPY_TO_RAM build with dual cores and multiple DMA channels active in the background. I've got a lot of work to get done on the RP2040 in a limited CPU cycle budget.
For testing I've got a custom systick based logging loop that I can run on both cores (C0 and C1) simultaneously. Here's what some of my performance numbers look like in the following 4 scenarios:
- instructions in main RAM shared with the logger's data output and no DMA activity in main RAM
- One log write completes in ~20 CPU cycles (+-2)
print_tick_logs()
C1[ 1] 16777195 20: 1 after
C0[ 1] 16777193 22: 1 after
C1[ 2] 16777175 20: 2 after
C0[ 2] 16777173 20: 2 after
C1[ 3] 16777155 20: 3 after
C0[ 3] 16777153 20: 3 after
C1[ 4] 16777134 21: 4 after
C0[ 4] 16777133 20: 4 after
C1[ 5] 16777114 20: 5 after
C0[ 5] 16777113 20: 5 after
C1[ 6] 16777094 20: 6 after
C0[ 6] 16777093 20: 6 after
C1[ 7] 16777074 20: 7 after
C0[ 7] 16777072 21: 7 after
C0[ 8] 16777052 20: 8 after
C1[ 8] 16777052 22: 8 after
C0[ 9] 16777030 22: 9 after
C1[ 9] 16777030 22: 9 after
C0[ 10] 16777012 18: 10 after
C1[ 10] 16777010 20: 10 after
- instructions in XIP RAM with the logger's data output in main RAM and no DMA activity in main RAM
- One log write completes in ~22 CPU cycles (+-2)
print_tick_logs()
C1[ 1] 16777192 23: 1 after
C0[ 1] 16777191 23: 1 after
C1[ 2] 16777170 22: 2 after
C0[ 2] 16777169 22: 2 after
C1[ 3] 16777148 22: 3 after
C0[ 3] 16777147 22: 3 after
C1[ 4] 16777126 22: 4 after
C0[ 4] 16777125 22: 4 after
C1[ 5] 16777104 22: 5 after
C0[ 5] 16777103 22: 5 after
C1[ 6] 16777082 22: 6 after
C0[ 6] 16777081 22: 6 after
C1[ 7] 16777060 22: 7 after
C0[ 7] 16777059 22: 7 after
C1[ 8] 16777038 22: 8 after
C0[ 8] 16777037 22: 8 after
C0[ 9] 16777015 22: 9 after
C1[ 9] 16777014 24: 9 after
C1[ 10] 16776995 19: 10 after
C0[ 10] 16776994 21: 10 after
- instructions in main RAM shared with the logger's data output and concurrent DMA activity in main RAM
- One log write completes in ~26 CPU cycles (+-7)
print_tick_logs()
C0[ 1] 16777190 24: 1 after
C1[ 1] 16777182 33: 1 after
C0[ 2] 16777165 25: 2 after
C1[ 2] 16777152 30: 2 after
C0[ 3] 16777136 29: 3 after
C1[ 3] 16777125 27: 3 after
C0[ 4] 16777111 25: 4 after
C1[ 4] 16777099 26: 4 after
C0[ 5] 16777087 24: 5 after
C1[ 5] 16777069 30: 5 after
C0[ 6] 16777062 25: 6 after
C1[ 6] 16777043 26: 6 after
C0[ 7] 16777036 26: 7 after
C1[ 7] 16777022 21: 7 after
C0[ 8] 16777011 25: 8 after
C1[ 8] 16776987 35: 8 after
C0[ 9] 16776982 29: 9 after
C0[ 10] 16776961 21: 10 after
C1[ 9] 16776953 34: 9 after
C1[ 10] 16776926 27: 10 after
- instructions in XIP RAM with the logger's data output in main RAM and concurrent DMA activity in main RAM
- One log write completes in ~24 CPU cycles (+-3)
print_tick_logs()
C0[ 1] 16777190 25: 1 after
C1[ 1] 16777188 27: 1 after
C0[ 2] 16777165 25: 2 after
C1[ 2] 16777162 26: 2 after
C0[ 3] 16777142 23: 3 after
C1[ 3] 16777139 23: 3 after
C0[ 4] 16777117 25: 4 after
C1[ 4] 16777114 25: 4 after
C0[ 5] 16777093 24: 5 after
C1[ 5] 16777091 23: 5 after
C0[ 6] 16777070 23: 6 after
C1[ 6] 16777069 22: 6 after
C0[ 7] 16777045 25: 7 after
C1[ 7] 16777043 26: 7 after
C0[ 8] 16777022 23: 8 after
C1[ 8] 16777019 24: 8 after
C0[ 9] 16776995 27: 9 after
C1[ 9] 16776994 25: 9 after
C0[ 10] 16776972 23: 10 after
C1[ 10] 16776969 25: 10 after
Each CPU can fetch 2 instructions in 1 cycle, so theoretically the two cpus pulling instructions over the XIP I/O bus will see some contention at first but then naturally start interleaving their instruction fetches and be able to feed a sustainable 1 instruction per CPU cycle. This contention likely accounts for the extra 2 CPU cycles seen while using XIP RAM in scenario 2 vs using main RAM scenario 1.
In scenarios 3 and 4, where there's heavy contention with the single DMA engine copying data around in main RAM, we see benefits from running instructions out of XIP RAM. It is faster yes, but maybe more importantly the time ranges are tighter.
So to answer your question: running instructions for both CPUs out of XIP RAM does imply a small performance hit vs main RAM when there's little contention in main RAM, but when there is contention in main RAM it can perform better in both speed and timing predictability.
Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040
Presumably this only works for binaries less than 16kB? And for binaries that small, wouldn't they mostly end up persisting in the XIP cache anyway when run as a regular flash binary? 🤔
You can use the new "pico_sections_time_critical()" CMake function to put your whole project's instruction set into the 16kB of XIP RAM if they fit, but the changes I've provided are really geared more towards moving a targeted subset of functions over into the the 16kB of XIP RAM. This would be done either by applying the "__time_critical_func" macro in code directly, or by specifying the "pico_sections_time_critical()" CMake function in build scripts along with a list of target source files that should have all of their functions placed into XIP RAM.
For COPY_TO_RAM and NO_FLASH builds, binaries are not resident in the XIP cache.
Note: the linker complains and fails the build if your binaries exceed the 16kB capacity of the XIP RAM.
Thanks for answering my (probably naive) questions.
Each CPU can fetch 2 instructions in 1 cycle, so theoretically the two cpus pulling instructions over the XIP I/O bus will see some contention at first but then naturally start interleaving their instruction fetches and be able to feed a sustainable 1 instruction per CPU cycle. This contention likely accounts for the extra 2 CPU cycles seen while using XIP RAM in scenario 2 vs using main RAM scenario 1.
If your code is very timing-critical, I wonder if you might be able to get even better performance by allocating the non-striped SRAM banks to specific CPUs and/or DMA channels? See section 2.6.2 in https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf
If your code is very timing-critical, I wonder if you might be able to get even better performance by allocating the non-striped SRAM banks to specific CPUs and/or DMA channels? See section 2.6.2 in https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf
I looked at this. The striped main RAM does a reasonably good job of distributing multiple CPU & DMA engine accesses across a lesser count of main RAM bus connections. Assigning dedicated SRAM banks is only going to be a big performance win when access to the individual SRAM banks is serialized by using just one CPU or DMA engine at a time.
For example the core0 processor's stack is placed in scratch_y SRAM bank and the core1 processor's stack is placed in scratch_x SRAM bank. Since stack access is limited to only one CPU thread this works out nicely. Note: those scratch buffers can also be used to store thread local data for their respective CPUs without fear of access contention.
My main consideration for the changes here is that there can be contention between the CPU instruction fetches and the CPU data fetches/writes. In the RP2040's "default" and "blocked_ram" binary type builds the CPU instruction fetches go to the XIP cache, so the instruction+data contention doesn't happen there, but in the RP2040's "copy_to_ram" and "no_flash" binary type builds the CPU instruction fetches share the four main RAM bus connections with the CPU data fetches/writes, so the instruction+data contention can happen there. This PR introduces a build time toggle for the RP2040's "copy_to_ram" and "no_flash" binary type builds allowing their CPU instruction fetches to go to the XIP RAM, so the instruction+data contention can be avoided in these binary type builds too.
The RP2040 has limited bus connections and it looks like the RP2350 would improve this situation dramatically for my use case, but my aim with this PR is not so grand. I did some work to figure this stuff out for my own exploration, and I want to share this change which I think makes it easy for other developers to unlock the XIP bus connection for use within the RP2040's "copy_to_ram" and "no_flash" binary type builds.
converted to draft as likely subsumed by #2660
@will-v-pi had pointed out some CMake issues. I believe I've fixed those. I also tried to cleanup some of my comments a bit. See my updated changes in this PR.