neorv32 feature request - support for Zc* extensions

the proposed Zc* extensions described here have recently been ratified....

the Zca extensions is of particular interest in "small cores" with limit memory resources....

just a placeholder for what appears to be a non-trivial improvement....

Jun 21 '23 02:06 biosbob

The standard C extension is already implemented (-> CPU_EXTENSION_RISCV_C generic). In this particular case C = Zcf, but all the compressed floating-point operations are mapped to normal integer load/store because Zfinx is used instead of F.

I had a closer look at the Zc* ISA extension. It think it is quite promising. However, in terms of the NEORV32 I am not fully convinced if it would be a good idea to implemten all of them.

Zca - this is what we have if the FPU is disabled (Zfinx disabled)
Zcf - this is what we have if the FPU is enabled
Zcd makes no sense as the FPU is single-precision only
Zcmp (list-based push/pop similar to ARM) is quite interesting, but would require a lot of additional hardware. Furthermore, precise exception trapping is complex here as there are several memory loads/stores invoked by a single instruction.
Zcmt (table-based jumps) might be nice thing to have. But this would have a high latency - so the only gain would be further code size reduction (not performance).
Zcb - I really like this extension because it adds 16-bit variants for common operations (like multiplication). This would be quite easy to implement I think. So, yeah, maybe this sub-extension might be integrated in the future. :wink:

This is just my opinion. Any thoughts?

Jun 23 '23 18:06 stnolting

Zcb looks promising, in that it is relatively easy to implement.... with disciplined declarations of integer types in EM (uint8, int16, uint32, etc) this would mesh quite well.... future CPU implementations that use (say) an internal 8- or 16-bit ALU would also benefit; reducing ALU width obviously saves gates....

as for Zcmp, the EM runtime would generally place some common push/pop code fragments (used by LLVM) into the boot ROM.... even with the smallest boot ROM (say, 2K), there is plenty of "free and fast" instruction memory....

i'm not entirely sure what motivates Zcmt.... but honestly, reducing code size without a performance gain doesn't seem worth it anyway....

bottom line -- Zcb would be first, when we get around to it....

Jun 23 '23 18:06 biosbob

bottom line -- Zcb would be first, when we get around to it....

I agree! But before we start implementing that we should wait for GCC support. Unfortunately, upcoming GCC 13(.1) does not include Zcb (https://gcc.gnu.org/gcc-13/changes.html).

Jun 30 '23 05:06 stnolting

i'm finding that LLVM is much more current with risc-v extensions.... looks like they have lots of Zb* support -- as requested in #640

since EM supports both compilers, comparative benchmarks are trivial....

Jun 30 '23 15:06 biosbob

As per benchmark Zcmp saves up to 35% (~6.5% average) of footprint. GCC looks set to adopt it too.

Jan 23 '24 16:01 kimstik

Interesting results! Thanks for sharing!

35% would be quite amazing, but I'm not sure what the "cost" of that might be (additional hardware resources, impact on critical path, etc.). Zcmp adds push and pop operations that would require to modify the CPU's pipeline (as there are several memory accesses triggered by a single instruction).

But the NEORV32's execution stage is a multi-cycle architecture... so maybe the additional hardware overhead would be quite small... I think I'll need to have a closer look at this again.

Jan 23 '24 22:01 stnolting

Moreover, Zcb has become mandatory for RVA2023 profile.

Jan 26 '24 16:01 kimstik

Oh, I did not expect that. However, RVA is the application-class profile (MMU, 64-bit, ...), which is out of scope of this project right now 🙈

I had another look at the Zcb specs. Basically, it just adds 11 new compressed instructions. Adding the memory operations should be quite easy and I think the performance benefit might be noticeable. Adding the remaining instructions (bit-manip, multiplication, inversion) is a little bit more complex but still doable.

Anybody volunteering to do a PR? 😅

Jan 26 '24 21:01 stnolting

btw, I tested LLVM 18 with/without Zcmp:

Apr 30 '24 16:04 kimstik

The Zcmp's push/pop instruction are quite powerful as they can "compress" up to 13 load/stores and an addition into a single 16-bit instruction! They might even increase performance a little bit as there will be less traffic on the CPU's instruction fetch interface.

However, the big problem with these two instructions is that they do not de-compress into a 32-bit counterpart. Instead, they decompress into several and different instructions which would require a lot of hardware overhead. So I think that the "costs" clearly exceed the benefits here.

What do you think? 🤔

May 04 '24 10:05 stnolting

The main advantages from my biased point of view are reducing the load on the instruction fetch channel, less cache pollution, and it should have a positive impact on interrupt handler latency. However, the most valuable aspect is the reduction of byte-code size to at least the level of Cortex-M0.

I think technical difficulties are unavoidable, and it's hard to objectively evaluate their value until they come into play :). If the overhead is truly enormous, the number of configurations where Zcmp would be useful will shrink to a minimum. But I want to hope that the overhead won't be so huge and won't ruin the whole idea.

May 07 '24 10:05 kimstik

The main advantages from my biased point of view are reducing the load on the instruction fetch channel

That's true! In its best case, this instruction saves up to 27 further 16-bit words from being fetched.

less cache pollution

Also true. However, embedded single-core systems might not need any kind of caches if you use fast on-chip memory.

and it should have a positive impact on interrupt handler latency

I'm not sure about this. Execution time would be identical. However, due to cache pollution / bus congestion there might be a relevant speedup.

However, the most valuable aspect is the reduction of byte-code size to at least the level of Cortex-M0.

Maybe, but technically such a complex instruction isn't "RISC" anymore, right? 😅

If the overhead is truly enormous, the number of configurations where Zcmp would be useful will shrink to a minimum. But I want to hope that the overhead won't be so huge and won't ruin the whole idea.

I think there are several benchmark examples provided by the people who invented these extended compressed instructions. The benefit (entirely looking at code size and performance) is quite impressive!

May 13 '24 20:05 stnolting

neorv32 neorv32 copied to clipboard

feature request - support for Zc* extensions

neorv32
neorv32 copied to clipboard