neorv32
neorv32 copied to clipboard
feature request - support for Zc* extensions
the proposed Zc*
extensions described here have recently been ratified....
the Zca
extensions is of particular interest in "small cores" with limit memory resources....
just a placeholder for what appears to be a non-trivial improvement....
The standard C
extension is already implemented (-> CPU_EXTENSION_RISCV_C
generic). In this particular case C
= Zcf
, but all the compressed floating-point operations are mapped to normal integer load/store because Zfinx
is used instead of F
.
I had a closer look at the Zc*
ISA extension. It think it is quite promising. However, in terms of the NEORV32 I am not fully convinced if it would be a good idea to implemten all of them.
-
Zca
- this is what we have if the FPU is disabled (Zfinx
disabled) -
Zcf
- this is what we have if the FPU is enabled -
Zcd
makes no sense as the FPU is single-precision only -
Zcmp
(list-based push/pop similar to ARM) is quite interesting, but would require a lot of additional hardware. Furthermore, precise exception trapping is complex here as there are several memory loads/stores invoked by a single instruction. -
Zcmt
(table-based jumps) might be nice thing to have. But this would have a high latency - so the only gain would be further code size reduction (not performance). -
Zcb
- I really like this extension because it adds 16-bit variants for common operations (like multiplication). This would be quite easy to implement I think. So, yeah, maybe this sub-extension might be integrated in the future. :wink:
This is just my opinion. Any thoughts?
Zcb
looks promising, in that it is relatively easy to implement.... with disciplined declarations of integer types in EM (uint8
, int16
, uint32
, etc) this would mesh quite well.... future CPU implementations that use (say) an internal 8- or 16-bit ALU would also benefit; reducing ALU width obviously saves gates....
as for Zcmp
, the EM runtime would generally place some common push/pop code fragments (used by LLVM) into the boot ROM.... even with the smallest boot ROM (say, 2K), there is plenty of "free and fast" instruction memory....
i'm not entirely sure what motivates Zcmt
.... but honestly, reducing code size without a performance gain doesn't seem worth it anyway....
bottom line -- Zcb
would be first, when we get around to it....
bottom line -- Zcb would be first, when we get around to it....
I agree! But before we start implementing that we should wait for GCC support. Unfortunately, upcoming GCC 13(.1) does not include Zcb
(https://gcc.gnu.org/gcc-13/changes.html).
i'm finding that LLVM is much more current with risc-v extensions.... looks like they have lots of Zb*
support -- as requested in #640
since EM supports both compilers, comparative benchmarks are trivial....
Interesting results! Thanks for sharing!
35% would be quite amazing, but I'm not sure what the "cost" of that might be (additional hardware resources, impact on critical path, etc.). Zcmp
adds push and pop operations that would require to modify the CPU's pipeline (as there are several memory accesses triggered by a single instruction).
But the NEORV32's execution stage is a multi-cycle architecture... so maybe the additional hardware overhead would be quite small... I think I'll need to have a closer look at this again.
Moreover, Zcb has become mandatory for RVA2023 profile.
Oh, I did not expect that. However, RVA is the application-class profile (MMU, 64-bit, ...), which is out of scope of this project right now 🙈
I had another look at the Zcb
specs. Basically, it just adds 11 new compressed instructions. Adding the memory operations should be quite easy and I think the performance benefit might be noticeable. Adding the remaining instructions (bit-manip, multiplication, inversion) is a little bit more complex but still doable.
Anybody volunteering to do a PR? 😅
btw, I tested LLVM 18 with/without Zcmp:
The Zcmp
's push/pop instruction are quite powerful as they can "compress" up to 13 load/stores and an addition into a single 16-bit instruction! They might even increase performance a little bit as there will be less traffic on the CPU's instruction fetch interface.
However, the big problem with these two instructions is that they do not de-compress into a 32-bit counterpart. Instead, they decompress into several and different instructions which would require a lot of hardware overhead. So I think that the "costs" clearly exceed the benefits here.
What do you think? 🤔
The main advantages from my biased point of view are reducing the load on the instruction fetch channel, less cache pollution, and it should have a positive impact on interrupt handler latency. However, the most valuable aspect is the reduction of byte-code size to at least the level of Cortex-M0.
I think technical difficulties are unavoidable, and it's hard to objectively evaluate their value until they come into play :). If the overhead is truly enormous, the number of configurations where Zcmp would be useful will shrink to a minimum. But I want to hope that the overhead won't be so huge and won't ruin the whole idea.
The main advantages from my biased point of view are reducing the load on the instruction fetch channel
That's true! In its best case, this instruction saves up to 27 further 16-bit words from being fetched.
less cache pollution
Also true. However, embedded single-core systems might not need any kind of caches if you use fast on-chip memory.
and it should have a positive impact on interrupt handler latency
I'm not sure about this. Execution time would be identical. However, due to cache pollution / bus congestion there might be a relevant speedup.
However, the most valuable aspect is the reduction of byte-code size to at least the level of Cortex-M0.
Maybe, but technically such a complex instruction isn't "RISC" anymore, right? 😅
If the overhead is truly enormous, the number of configurations where Zcmp would be useful will shrink to a minimum. But I want to hope that the overhead won't be so huge and won't ruin the whole idea.
I think there are several benchmark examples provided by the people who invented these extended compressed instructions. The benefit (entirely looking at code size and performance) is quite impressive!