rocket-chip
rocket-chip copied to clipboard
Zk(Zbk, Zkn, Zks)/Zb: Scalar Cryptography/Bitmanip Extension
Related PR: #2906
Type of change: other enhancement
Impact: API addition (no impact on existing code)
Development Phase: implementation (needs cleanup)
Note This has borrowed a lot of code (mainly decode logic) from #2906. The decode logic has been rebased on the master due to the merging of H extension.
Some of the logics are also inspired by #2906, thanks!
While #2906 has different implementations for rv32 and rv64, this PR merges them together.
As for Zkr, as #2906 is just a stub, I think a Chisel stub is fairly easy to implement. In the meantime, the CSR logics should be PRed by the owner of #2906 instead of me. Maybe users can just cherry-pick that commit and implement their own Zkr?
There are some notes on Zkt (like ROM optimization), but it is not verified. As this is a single-cycle implementation, Zkt ought to be satisfied (not verified though).
A lot of helpers are temply embedded here, and should be cleaned up. barrel.rightRotate is borrowed from here
This has passed riscv-arch-test on rv64i_m/K and rv32i_m/K when verilated to an emulator
Cc @sequencer @phthinh @ben-marshall
Also thanks @liweiwei90 for his help on riscv-arch-test
TODO
- Clean up of the code on mergeing common datapaths and bad handling of some logics (like multiple levels of Mux of io.out)
- Clean up of some helpers (e.g. move them into zk/utils.scala) / get helpers upstreamed to Chisel
- Add CI tests on Zk for rocket-chip (How?)
- Test against FPGA
Discussion
- Should we merge zbk into ALU as they share some logics (like
~io.rs2and logical operation gates for andn/orn/xnor) - Should we embed SBox as ROM here or just write real-time computing logics, as #2906 and Xiangshan have done
Release Notes
This pull request is to add the hardware implementation to support the Zbk, Zkn, Zks of the cryptographic extensions Zk on the Rocket-chip. This supports the Scalar Instructions specified in the v1.0.1 version. The implementation can work with 32-bit and 64-bit Architectures. The plug-in supports can be enabled by using the WithZBK, WithZKN, WithZKS configuration options.
Progress report:
- Implemented Zb in ALU
- Moved part of Zkb into ALU
- Refactored ALU
Note that this definitely can not compile now, and wont work even with syntax fixed as there is no decode nor control signal now.
Next steps:
- Implement the decode logic, micro ops then control signals
- Refactor ZK part
- synth and PPA report on the single cycle implementation
Review in RC working group suggested putting the refactored ALU into a separate file (maybe ABLU.scala?) instead of directly changing it. and use an option (e.g. withZB or withExperimentalALU) to switch between old and new implementations. Only if the refactored one turns out to be better after extensive test could it replace the old ALU.
Now this passes the Zk arch-test and the Zb arch-test, it could be used now, but clean-up is needed. More specifically, options should be added so that users can choose whether to use the new ABLU.
Among the new commits there are cherry-picked commits from #2956 (otherwise I cannot write decode logics). As I understand, that PR should be reworked after the PR on riscv-opcodes gets merged.
Hi, we recently tested this commit using our hardware fuzzer, the arithmetic function is consistent with the spike implementation, nice work! We detected some mismatches in some corner cases:
- Rocket has a custom instruction extension, the mstatus XS bit is always enabled. Our co-simulate environment was forced to disable the original custom extension and found that the K extension do not enable or check this bit. https://github.com/chipsalliance/rocket-chip/blob/045d03c54bce7636401b1b4b17d6a3677356dfe0/src/main/scala/rocket/CSR.scala#L598-L603
- The current design always enables some B extension. When disable zbb extension, spike throws an exception on rev8 while rocket executes it as a normal instruction.
spike config: core 1, isa: rv32gc_xdummy MSU vlen:128,elen:64
...
[rocket] 3 0x80000182 (0x698dd493) x 9 0x1a3e7d6a
[spike] core 0: 0x80000182 (0x698dd493) rev8 s1, s11
[spike] core 0: exception trap_illegal_instruction, epc 0x80000182
[spike] core 0: tval 0x698dd493
[spike] core 0: 0x80000004 (0x34202f73) csrr t5, mcause
[error] PC SIM 0000000080000004 , DUT 0000000080000182
[error] INSN SIM 34202f73 , DUT 698dd493
[CJ] Commit Failed
- As described in the spec, aes64ks1i should throw an illegal instruction when rcon is bigger than 0xA. Rocket did not throw an illegal instruction exception as spike did.
spike config: core 1, isa: rv64gc_xdummy_zk_zkn_zks MSU vlen:128,elen:64
...
[rocket] 0 0x0080000178 (0x310d1413) x 8 0x1616161716161617
[spike] core 0: 0x0000000080000178 (0x310d1413) aes64ks1i s0, s10, 0x0
[rocket] 0 0x008000017c (0x31ff9493) x 9 0x6330636363306363
[spike] core 0: 0x000000008000017c (0x31ff9493) aes64ks1i s1, t6, 0xf
[spike] core 0: exception trap_illegal_instruction, epc 0x000000008000017c
[spike] core 0: tval 0x0000000000000000
[spike] core 0: 0x0000000080000004 (0x34202f73) csrr t5, mcause
[error] PC SIM 0000000080000004 , DUT 000000008000017c
[error] INSN SIM 34202f73 , DUT 31ff9493
[CJ] Commit Failed
Thanks for your attention on this PR!
- The current design always enables some B extension. When disable zbb extension, spike throws an exception on rev8 while rocket executes it as a normal instruction.
Yes that's what "but clean-up is needed" in my last comment meant. You can see that I have manually enabled both Zb/Zk extensions for this commit. I will add a new commit for new options soon. Actually, if these hard-encoded "true" are all turned to "false" then instructions like rev8 will cause illegal instruction exception.
- As described in the spec, aes64ks1i should throw an illegal instruction when rcon is bigger than 0xA. Rocket did not throw an illegal instruction exception as spike did.
Thanks for detecting this! Yes I have noticed this when reading the spec but as I was following the IO interface of #2906, I did not implement this mechanism. I will fix this issue later.
- found that the K extension do not enable or check this bit.
By the privileged spec (20211203),
In systems without additional user extensions requiring new state, the XS field is read-only zero. Every additional extension with state provides a CSR field that encodes the equivalent of the XS.
I think Zk/Zb does not have states. Even though Zkr added a seed CSR, I do not think that is a state. Also this PR has not implemented Zkr. I am not so sure about this, correct me if there is any misunderstanding.
Sorry for the confusion. Since misa is writable, for the extended instruction set we need to detect whether the corresponding extension is turned on during execution, which means that there are two places to be checked, one is the misa register and the other is XS bit. The current implementation seems unable to turn off these extended.
Since misa is writable, for the extended instruction set we need to detect whether the corresponding extension is turned on during execution
AFAIK, there is no misa bits for Zb/Zk extension, thus the detection and enabling/disabling of Zb/Zk can not be done via misa CSR. Currently I do not know how to detect whether the platform supports Zb/Zk in software, please tell me if anyone knows that.
and the other is XS bit.
XS bits are also not related to Zb/Zk as Zb/Zk has no CPU state that needs to be saved by supervisor; Zb/Zk also does not define a CSR field that "encodes the qeuivalent of the XS states" as the spec said.
I've tried to synthesis the PR using vendor tools against nangate45. According to the synthesis result I've made changes to SBox implementation.
Area
At commit 18927be with RV64, the area is
| Hierarchical cell | Absolute Total |
|---|---|
| Rocket | 80570 |
| ablu | 4306 |
| div | 8015 |
| zbk | 7612 |
| zkn | 19378 |
| zkn/aes_so_m | 1899 |
| zkn/aes_so_m/dec | 950 |
| zkn/aes_so_m/enc | 923 |
| zks | 1471 |
| zks/so_m | 934 |
Note that for RV64, there are 16 AES SBoxes (8 Enc, 8 Dec) and one SM4 Sbox, each takes about 950 area. This is quite unbearable as zkn takes around 1/4 of the total area. As a comparison, div only takes 8000.
Then I added 8240ffd which is inspired from riscv-crypto/rtl and the area result is
| Hierarchical cell | Absolute Total |
|---|---|
| Rocket | 67377 |
| ablu | 4309 |
| div | 8015 |
| zbk | 7617 |
| zkn | 6804 |
| zkn/aes_so_m | 323 |
| zks | 709 |
After this commit zkn only takes nearly the same area as div, which is acceptable for RV64 IMHO. Note that for RV32 there is only three SBox (AES enc/dec, SM4) and their AESMid can be reused (marked in TODO but not done yet), which is also acceptable.
Originally RC working group had suggested reducing the number of Sboxes and reuse them across multiple cycles. I think the current implementation is acceptable for extremely high crypto performance when using RV64 (RV32 does not have this single/multi-cycle trade-off). I can work on reducing the number by two (half the area), just similar to the SCIEPipelined, but I want to know if it is worth the time. I agree that providing a parameter (do one encryption in 1, 2, 4, 8 cycle) is more flexible for end users, but this complicates the design and needs a lot of effort.
As a reference, this is the area report of b36060ebe
| Hierarchical cell | Absolute Total |
|---|---|
| Rocket | 49154 |
| alu | 1721 |
| div | 8015 |
ABLU doubles the area of the original ALU. Note that ABLU does not contain clmul and xperm (in zbk above) and the result showed that they are quite costly (two operations along are bigger than a whole ABLU).
Timing
The timing part is a tricky part. In the original Rocket, the critical path actually lies in div part. But in this PR the ABLU becomes the critical path. The slack for div is at most 2.6 when synthesis to 100MHz but the slack for ablu is at 1.25~1.8
I doubt that this bad timing has to do with the pla in ABLU. Actually we have done two decoding here: First IDecode from instruction to FN; then FN to ctrl signals. This should all be done in Decode stage. I will experiment the timing result when that pla is moved into Decode stage.
However, decoding all the ctrl signals in Decode stage will increase the footprint of registers (e.g. ex_ctrl). Since the original FN is only 4 bits but the ctrl signals for ABLU along is about 40 bits, it is worth arguing whether we should do it.
Another thing to note is that the timing of the decode stage is not so good either.
Surprisingly, ZKN/ZKS/ZBK does not have timing issue, at least not as bad as div. It is worth noting that we originally thought clmul/xperm need to be pipelined as it is deep (64 xor) but the timing report does not have any complaint about them. Based on this result I will merge ZBK into ABLU; ZBK was originally separated for a multi-cycle implementation.
Cleanup
I've added some configs withABLU, withBitManip, withBitManipCrypto, withCryptoNIST and withCryptoSM, thus @Phantom1003 can disable an extension now.
Note that the misa string now will not contain "B".
There are still further cleanups like merging the decode table, upstreaming some utils and fixing issues from Phantom1003.
I think the current implementation is acceptable for extremely high crypto performance when using RV64
I've made a benchmark for RV64 at https://gist.github.com/ZenithalHourlyRate/7b5175734f87acb73d0bbc53391d7140 using this PR against my software AES implementation in OpenSSL. It turns out that the result is considerable but not too high (up to 9.5X for aes-256-ecb). As the assembly snippet has shown, aes64esm are often executed consecutively, thus if a multi-cycle version was implemented, we would need to stall the pipeline most of the time. I think the single cycle implementation is acceptable for now (both area and latency). Multi-cycle version shall be done in future PRs if there is indeed this usage.
I'll work on fixing the issue discovered by Phantom1003 and the timing issue (pla decode) and cleaning up, further architectural changes on crypto would not probably be made in this PR.
With the Add aes64ks1i exception commit the issue @Phantom1003 mentioned should be fixed.
I have run a sample program using spike pk test and emulator pk test, the result is consistent
# spike
core 0: 0x0000000000011480 (0x31069713) aes64ks1i a4, a3, 0
core 0: 0x0000000000011484 (0x7e670333) aes64ks2 t1, a4, t1
core 0: 0x0000000000011488 (0x7e7303b3) aes64ks2 t2, t1, t2
core 0: 0x000000000001148c (0x02060613) addi a2, a2, 32
core 0: 0x0000000000011490 (0x00663023) sd t1, 0(a2)
core 0: 0x0000000000011494 (0x00763423) sd t2, 8(a2)
core 0: 0x0000000000011498 (0x31c39713) aes64ks1i a4, t2, 12
core 0: exception trap_illegal_instruction, epc 0x0000000000011498
core 0: tval 0x0000000000000000
core 0: 0x0000000080000004 (0x34011173) csrrw sp, mscratch, sp
core 0: 0x0000000080000008 (0x1a010863) beqz sp, pc + 432
# emulator
C0: 1753565 [1] pc=[0000000000011480] W[r14=6363636263636362][1] R[r13=0000000000000000] R[r16=0000000000000000] inst=[31069713] aes64ks1i a4, a3, 0
C0: 1753566 [1] pc=[0000000000011484] W[r 6=6363636263636362][1] R[r14=6363636263636362] R[r 6=0000000000000000] inst=[7e670333] aes64ks2 t1, a4, t1
C0: 1753567 [1] pc=[0000000000011488] W[r 7=6363636263636362][1] R[r 6=6363636263636362] R[r 7=0000000000000000] inst=[7e7303b3] aes64ks2 t2, t1, t2
C0: 1753568 [1] pc=[000000000001148c] W[r12=0000003ffffff928][1] R[r12=0000003ffffff908] R[r 0=0000000000000000] inst=[02060613] addi a2, a2, 32
C0: 1753569 [1] pc=[0000000000011490] W[r 0=0000000000000000][0] R[r12=0000003ffffff928] R[r 6=6363636263636362] inst=[00663023] sd t1, 0(a2)
C0: 1753570 [1] pc=[0000000000011494] W[r 0=0000000000000000][0] R[r12=0000003ffffff928] R[r 7=6363636263636362] inst=[00763423] sd t2, 8(a2)
C0: 1753571 [0] pc=[0000000000011498] W[r14=0000000000000000][0] R[r 7=0000000031c39713] R[r28=0000000000000006] inst=[31c39713] aes64ks1i a4, t2, 12
C0: 1753598 [1] pc=[0000000080000004] W[r 2=0000000080011ec0][1] R[r 2=0000003ffffff8f0] R[r 0=0000000000000000] inst=[34011173] csrrw sp, mscratch, sp
C0: 1753601 [1] pc=[0000000080000008] W[r 0=0000000000000000][0] R[r 2=0000000080011ec0] R[r 0=0000000000000000] inst=[1a010863] beqz sp, pc + 432
Commit f2cab56 "Move pla into decode" improves the timing for ABLU. When synthesized in 100MHz (10ns), the worst slack is now 1.65 instead of 1.25.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: phthinh / name: Thinh Pham (ee289d44ed21b193ffd53968eef96b30df12195f, 386e5e99b6009ca4725c09f891e5c8884a9e2769, a3ba274e200619afedf8390b5f29adbb62d8fcdb, 35f02e11d504e7c5701689ee976814d1b54bfb62, c3278636d0c801970274e4cc295d9e27bf93ba7f, 36350c90a2148183b803346f09d7581e2b2df913)
- :white_check_mark: login: ZenithalHourlyRate / name: Zenithal (5932acb74c3fd220cfe2599af786131db9ae565a, 5f4fcbb09258f45007eecb609d1c7d735e8fd833, da9d6b82424612ad07c9c67715657949ace5e686, 428f8080fc606e4aed0ad7244cbb4b1b4b892a88, 58dc1b536802c86477b5a250fb3be93fc3ee555f, a786f231966b4a1fc1d0cee3893fcf9a1077083c, 2eefbe0cbd18edd8a8bde2d6bb96a95b72b9ab9b, c5e0bd1bcf373fe99579975661d673d0a4160913, af97f3a18028b351cc65361fa64a063544fe3150, 0b46e1db18e6473ae6171fe8e6f5eb38343f2892, 380bb42a4c4702211024759b93678cf619779c2f, 009f0ed56a434071e07d9a76482e635bf5266622, f431ffd243394c8a331bfb3829d74ad47f18bbf8, 7ae146a64831b9102306b09b63a793c2d9f514d9, e2c81850a9a1d131883b7dae6a11de05eef8c84b, 2faae3f74685e8b8ab70b5d8d9ac9a6db21f0a4e, 265d4ee536b6c72a89498edd68ba5e2ae062c42b, 0ce3f74049b89d39067974096a7eb1ccc4c4d52b, d26e56199ed6ca8ad8e14e8364d7a40e396b1805, 76db33a7922b009464137aa6ae5bd15f43f6c142, d4acc70abfa5c0e73b218c0a3f0154b0689cbaf4, 7631b8ae2682f9ab4275d2d5f11c703a4afa00ae, 8240daf3a34b52fe59afa1a2e1c21b222b0df86d, 38728ef3e57ee226caf444d95fd745935b639c4d, 65bc35c19536e83f70784861748e3a2f8db0004f, d25c4b987b9827a70f1a83f1a05c1920c395f5a4, dce44b8e96722ec69d740c4748587840cd6d873a, 0a8d59fb595746c5bdeaeba3a2efc51f853044d1, 784b0e63a3599de35d37546fd570e031b8a5a91e, ab51f3fdfd4946c47133c622927a82c34dce5b35, 5fcd554ffdbfcbc940d91d7ef913a8daa813f15b, 3c294806a29f576f5ecae57812d9d77ff8bdad66, 960cf3e040a8072313fdde4f6b68dab809d8b7c2, 949498029155eb6e27e20584bf7d4ee26f1299a4, cd52e3ec42bd894819f877000d2023fd966595a9, da3972017418e0ba7b27bd13f4a86cf40cee9f09, 74f17e7affa6d7334ea82c585bee6474c2da71a1, de9efc49a145d93e3f48301a6b64123e835b307a, 181c1ade62a2ac0999c1c46ec880a76416da263c, 978118b2df56713bbabcde6babf166720e1a4309, fd7b7be2930a2ab976c98f88934cfb767cfebf82, 81db1537fcf9bf9092e44b165e72178dac61b548, 906cb6c021cb9929fdebdb77b07b9777773d9f27, 06688eb7ece207c379f73ab3da479675fc56be27, f39d059b6cb320e10aea13d5883444717fb5d879, c770dd972cbca2540ba1d55cba7b797081e873ea, 374de8f3dc2f143abc1e121df98f3b9a7a22e224, 8bab78ee34689db6e9fcbb06f5f422ca69ef606f, c0f226025594c513170482fdd29d7ffc5e6ca155, cc504cf4b9755a07b13b9919ce9b76d26802c93f, 773a11986cb96543b5f44b172792c594eea03fd7, ad3edffff97ad5752345cf8f2aae4f8ae441adb2, 18927be0a4f66e929e5dc2400a941b7903158906, 8240ffd06add867661e496e966091876198d9375, fcc711f048ef6f56ad6d02e2c56a67c26d401983, 8541647be7fc6229d2a8b798da7c666710b7bcc7, 37f089a5e4777c0c01f5da2454a454ae61fc1f86, 4f8114374d8824dfdec03f576a8cd68bebce4e56, c4c641a8631f7f0c915aa4a42827cb35fc898f5e, 1916f3aacc98e64e2d5cc5cc33535a37bbfd7aff, 1a76f5deeb1f51585b4d53e1978b915e4f19eeed, f2cab567824215835e639a787671af3510f25e2b, 076d041e02d08cc78500cdea004276f57f1eca87)
- :white_check_mark: login: cyyself / name: Yangyu Chen (84574f8a9aa28bd0493d9ed55f150fd3cb99d0aa)
Merged master, redesigned control signals and did some cleanup.
The IDecode: use DW_XPR for branch insts needs special attention as it is a regression introduced by the new decoder. See more in its commit message.
We have many things to test, but the test framework is not ready (#2991), like both RV32 and RV64, enabling both Zk/Zb, enabling only Zk and enabling only Zb. For aes64ks1i, the illegal instruction exception codepath should also be covered in CI, it seems that arch-test has not tested that.
I also added some if condition to prevent instantiating some datapath when Zb is enabled but Zbk is not.
I mark this PR as ready for review now. Some utils that could be upstreamed to Chisel, if merged, still need a longer time to hit rocket chip so I assume adding them here and removing them in future PRs is appropriate.
FYI, this PR is part of my undergraduate thesis (in Chinese) and I have made it public in my blog. Any one interested could find it here and there are two presentations (in English) on that thesis. I thought that presentation (and the figures in it) could be a documentation.
This has passed riscv-arch-test on rv64i_m/K and rv32i_m/K when verilated to an emulator
Finally added these tests as CI and these tests passed locally; cherry-picked many commits from #3201 so should get merged after that PR (when the OOM issue gets fixed).
Other cores depend on this stuff, and I need to think about making sure migration is not too painful (and does not cause bugs).
Agree! But notice eventually we should split rocket into a single package that only publish RocketCore as public api. For other cores depending on internal modules inside core will be deprecated.
I will clone this branch myself to check some things.
Should squash&merge for this PR because the history is too long..
Yes we should squash+merge. This PR looks fine for me now.
Once CI is done I will squash+merge
Once CI is done I will squash+merge
CI passed.