rocket-chip icon indicating copy to clipboard operation
rocket-chip copied to clipboard

Zk(Zbk, Zkn, Zks)/Zb: Scalar Cryptography/Bitmanip Extension

Open ZenithalHourlyRate opened this issue 3 years ago • 14 comments

Related PR: #2906

Type of change: other enhancement

Impact: API addition (no impact on existing code)

Development Phase: implementation (needs cleanup)

Note This has borrowed a lot of code (mainly decode logic) from #2906. The decode logic has been rebased on the master due to the merging of H extension.

Some of the logics are also inspired by #2906, thanks!

While #2906 has different implementations for rv32 and rv64, this PR merges them together.

As for Zkr, as #2906 is just a stub, I think a Chisel stub is fairly easy to implement. In the meantime, the CSR logics should be PRed by the owner of #2906 instead of me. Maybe users can just cherry-pick that commit and implement their own Zkr?

There are some notes on Zkt (like ROM optimization), but it is not verified. As this is a single-cycle implementation, Zkt ought to be satisfied (not verified though).

A lot of helpers are temply embedded here, and should be cleaned up. barrel.rightRotate is borrowed from here

This has passed riscv-arch-test on rv64i_m/K and rv32i_m/K when verilated to an emulator

Cc @sequencer @phthinh @ben-marshall

Also thanks @liweiwei90 for his help on riscv-arch-test

TODO

  1. Clean up of the code on mergeing common datapaths and bad handling of some logics (like multiple levels of Mux of io.out)
  2. Clean up of some helpers (e.g. move them into zk/utils.scala) / get helpers upstreamed to Chisel
  3. Add CI tests on Zk for rocket-chip (How?)
  4. Test against FPGA

Discussion

  1. Should we merge zbk into ALU as they share some logics (like ~io.rs2 and logical operation gates for andn/orn/xnor)
  2. Should we embed SBox as ROM here or just write real-time computing logics, as #2906 and Xiangshan have done

Release Notes

This pull request is to add the hardware implementation to support the Zbk, Zkn, Zks of the cryptographic extensions Zk on the Rocket-chip. This supports the Scalar Instructions specified in the v1.0.1 version. The implementation can work with 32-bit and 64-bit Architectures. The plug-in supports can be enabled by using the WithZBK, WithZKN, WithZKS configuration options.

ZenithalHourlyRate avatar Mar 19 '22 15:03 ZenithalHourlyRate

Progress report:

  1. Implemented Zb in ALU
  2. Moved part of Zkb into ALU
  3. Refactored ALU

Note that this definitely can not compile now, and wont work even with syntax fixed as there is no decode nor control signal now.

Next steps:

  1. Implement the decode logic, micro ops then control signals
  2. Refactor ZK part
  3. synth and PPA report on the single cycle implementation

ZenithalHourlyRate avatar Apr 04 '22 17:04 ZenithalHourlyRate

Review in RC working group suggested putting the refactored ALU into a separate file (maybe ABLU.scala?) instead of directly changing it. and use an option (e.g. withZB or withExperimentalALU) to switch between old and new implementations. Only if the refactored one turns out to be better after extensive test could it replace the old ALU.

ZenithalHourlyRate avatar Apr 06 '22 20:04 ZenithalHourlyRate

Now this passes the Zk arch-test and the Zb arch-test, it could be used now, but clean-up is needed. More specifically, options should be added so that users can choose whether to use the new ABLU.

Among the new commits there are cherry-picked commits from #2956 (otherwise I cannot write decode logics). As I understand, that PR should be reworked after the PR on riscv-opcodes gets merged.

ZenithalHourlyRate avatar Apr 18 '22 18:04 ZenithalHourlyRate

Hi, we recently tested this commit using our hardware fuzzer, the arithmetic function is consistent with the spike implementation, nice work! We detected some mismatches in some corner cases:

  • Rocket has a custom instruction extension, the mstatus XS bit is always enabled. Our co-simulate environment was forced to disable the original custom extension and found that the K extension do not enable or check this bit. https://github.com/chipsalliance/rocket-chip/blob/045d03c54bce7636401b1b4b17d6a3677356dfe0/src/main/scala/rocket/CSR.scala#L598-L603
  • The current design always enables some B extension. When disable zbb extension, spike throws an exception on rev8 while rocket executes it as a normal instruction.
spike config: core 1, isa: rv32gc_xdummy MSU vlen:128,elen:64
...
[rocket] 3 0x80000182 (0x698dd493) x 9 0x1a3e7d6a
[spike] core   0: 0x80000182 (0x698dd493) rev8    s1, s11
[spike] core   0: exception trap_illegal_instruction, epc 0x80000182
[spike] core   0:           tval 0x698dd493
[spike] core   0: 0x80000004 (0x34202f73) csrr    t5, mcause
[error] PC SIM 0000000080000004 , DUT 0000000080000182
[error] INSN SIM 34202f73 , DUT 698dd493
[CJ] Commit Failed

kext-0.zip

  • As described in the spec, aes64ks1i should throw an illegal instruction when rcon is bigger than 0xA. Rocket did not throw an illegal instruction exception as spike did.
spike config: core 1, isa: rv64gc_xdummy_zk_zkn_zks MSU vlen:128,elen:64
...
[rocket] 0 0x0080000178 (0x310d1413) x 8 0x1616161716161617
[spike] core   0: 0x0000000080000178 (0x310d1413) aes64ks1i s0, s10, 0x0
[rocket] 0 0x008000017c (0x31ff9493) x 9 0x6330636363306363
[spike] core   0: 0x000000008000017c (0x31ff9493) aes64ks1i s1, t6, 0xf
[spike] core   0: exception trap_illegal_instruction, epc 0x000000008000017c
[spike] core   0:           tval 0x0000000000000000
[spike] core   0: 0x0000000080000004 (0x34202f73) csrr    t5, mcause
[error] PC SIM 0000000080000004 , DUT 000000008000017c
[error] INSN SIM 34202f73 , DUT 31ff9493
[CJ] Commit Failed

kext-1.zip

Phantom1003 avatar Apr 22 '22 18:04 Phantom1003

Thanks for your attention on this PR!

  • The current design always enables some B extension. When disable zbb extension, spike throws an exception on rev8 while rocket executes it as a normal instruction.

Yes that's what "but clean-up is needed" in my last comment meant. You can see that I have manually enabled both Zb/Zk extensions for this commit. I will add a new commit for new options soon. Actually, if these hard-encoded "true" are all turned to "false" then instructions like rev8 will cause illegal instruction exception.

  • As described in the spec, aes64ks1i should throw an illegal instruction when rcon is bigger than 0xA. Rocket did not throw an illegal instruction exception as spike did.

Thanks for detecting this! Yes I have noticed this when reading the spec but as I was following the IO interface of #2906, I did not implement this mechanism. I will fix this issue later.

  • found that the K extension do not enable or check this bit.

By the privileged spec (20211203),

In systems without additional user extensions requiring new state, the XS field is read-only zero. Every additional extension with state provides a CSR field that encodes the equivalent of the XS.

I think Zk/Zb does not have states. Even though Zkr added a seed CSR, I do not think that is a state. Also this PR has not implemented Zkr. I am not so sure about this, correct me if there is any misunderstanding.

ZenithalHourlyRate avatar Apr 23 '22 00:04 ZenithalHourlyRate

Sorry for the confusion. Since misa is writable, for the extended instruction set we need to detect whether the corresponding extension is turned on during execution, which means that there are two places to be checked, one is the misa register and the other is XS bit. The current implementation seems unable to turn off these extended.

Phantom1003 avatar Apr 23 '22 04:04 Phantom1003

Since misa is writable, for the extended instruction set we need to detect whether the corresponding extension is turned on during execution

AFAIK, there is no misa bits for Zb/Zk extension, thus the detection and enabling/disabling of Zb/Zk can not be done via misa CSR. Currently I do not know how to detect whether the platform supports Zb/Zk in software, please tell me if anyone knows that.

and the other is XS bit.

XS bits are also not related to Zb/Zk as Zb/Zk has no CPU state that needs to be saved by supervisor; Zb/Zk also does not define a CSR field that "encodes the qeuivalent of the XS states" as the spec said.

ZenithalHourlyRate avatar Apr 23 '22 07:04 ZenithalHourlyRate

I've tried to synthesis the PR using vendor tools against nangate45. According to the synthesis result I've made changes to SBox implementation.

Area

At commit 18927be with RV64, the area is

Hierarchical cell Absolute Total
Rocket 80570
ablu 4306
div 8015
zbk 7612
zkn 19378
zkn/aes_so_m 1899
zkn/aes_so_m/dec 950
zkn/aes_so_m/enc 923
zks 1471
zks/so_m 934

Note that for RV64, there are 16 AES SBoxes (8 Enc, 8 Dec) and one SM4 Sbox, each takes about 950 area. This is quite unbearable as zkn takes around 1/4 of the total area. As a comparison, div only takes 8000.

Then I added 8240ffd which is inspired from riscv-crypto/rtl and the area result is

Hierarchical cell Absolute Total
Rocket 67377
ablu 4309
div 8015
zbk 7617
zkn 6804
zkn/aes_so_m 323
zks 709

After this commit zkn only takes nearly the same area as div, which is acceptable for RV64 IMHO. Note that for RV32 there is only three SBox (AES enc/dec, SM4) and their AESMid can be reused (marked in TODO but not done yet), which is also acceptable.

Originally RC working group had suggested reducing the number of Sboxes and reuse them across multiple cycles. I think the current implementation is acceptable for extremely high crypto performance when using RV64 (RV32 does not have this single/multi-cycle trade-off). I can work on reducing the number by two (half the area), just similar to the SCIEPipelined, but I want to know if it is worth the time. I agree that providing a parameter (do one encryption in 1, 2, 4, 8 cycle) is more flexible for end users, but this complicates the design and needs a lot of effort.

As a reference, this is the area report of b36060ebe

Hierarchical cell Absolute Total
Rocket 49154
alu 1721
div 8015

ABLU doubles the area of the original ALU. Note that ABLU does not contain clmul and xperm (in zbk above) and the result showed that they are quite costly (two operations along are bigger than a whole ABLU).

Timing

The timing part is a tricky part. In the original Rocket, the critical path actually lies in div part. But in this PR the ABLU becomes the critical path. The slack for div is at most 2.6 when synthesis to 100MHz but the slack for ablu is at 1.25~1.8

I doubt that this bad timing has to do with the pla in ABLU. Actually we have done two decoding here: First IDecode from instruction to FN; then FN to ctrl signals. This should all be done in Decode stage. I will experiment the timing result when that pla is moved into Decode stage.

However, decoding all the ctrl signals in Decode stage will increase the footprint of registers (e.g. ex_ctrl). Since the original FN is only 4 bits but the ctrl signals for ABLU along is about 40 bits, it is worth arguing whether we should do it.

Another thing to note is that the timing of the decode stage is not so good either.

Surprisingly, ZKN/ZKS/ZBK does not have timing issue, at least not as bad as div. It is worth noting that we originally thought clmul/xperm need to be pipelined as it is deep (64 xor) but the timing report does not have any complaint about them. Based on this result I will merge ZBK into ABLU; ZBK was originally separated for a multi-cycle implementation.

Cleanup

I've added some configs withABLU, withBitManip, withBitManipCrypto, withCryptoNIST and withCryptoSM, thus @Phantom1003 can disable an extension now.

Note that the misa string now will not contain "B".

There are still further cleanups like merging the decode table, upstreaming some utils and fixing issues from Phantom1003.

ZenithalHourlyRate avatar Apr 23 '22 21:04 ZenithalHourlyRate

I think the current implementation is acceptable for extremely high crypto performance when using RV64

I've made a benchmark for RV64 at https://gist.github.com/ZenithalHourlyRate/7b5175734f87acb73d0bbc53391d7140 using this PR against my software AES implementation in OpenSSL. It turns out that the result is considerable but not too high (up to 9.5X for aes-256-ecb). As the assembly snippet has shown, aes64esm are often executed consecutively, thus if a multi-cycle version was implemented, we would need to stall the pipeline most of the time. I think the single cycle implementation is acceptable for now (both area and latency). Multi-cycle version shall be done in future PRs if there is indeed this usage.

I'll work on fixing the issue discovered by Phantom1003 and the timing issue (pla decode) and cleaning up, further architectural changes on crypto would not probably be made in this PR.

ZenithalHourlyRate avatar May 01 '22 11:05 ZenithalHourlyRate

With the Add aes64ks1i exception commit the issue @Phantom1003 mentioned should be fixed.

I have run a sample program using spike pk test and emulator pk test, the result is consistent

# spike
core   0: 0x0000000000011480 (0x31069713) aes64ks1i a4, a3, 0
core   0: 0x0000000000011484 (0x7e670333) aes64ks2 t1, a4, t1 
core   0: 0x0000000000011488 (0x7e7303b3) aes64ks2 t2, t1, t2 
core   0: 0x000000000001148c (0x02060613) addi    a2, a2, 32 
core   0: 0x0000000000011490 (0x00663023) sd      t1, 0(a2)
core   0: 0x0000000000011494 (0x00763423) sd      t2, 8(a2)
core   0: 0x0000000000011498 (0x31c39713) aes64ks1i a4, t2, 12 
core   0: exception trap_illegal_instruction, epc 0x0000000000011498
core   0:           tval 0x0000000000000000
core   0: 0x0000000080000004 (0x34011173) csrrw   sp, mscratch, sp
core   0: 0x0000000080000008 (0x1a010863) beqz    sp, pc + 432
# emulator
C0:    1753565 [1] pc=[0000000000011480] W[r14=6363636263636362][1] R[r13=0000000000000000] R[r16=0000000000000000] inst=[31069713] aes64ks1i a4, a3, 0
C0:    1753566 [1] pc=[0000000000011484] W[r 6=6363636263636362][1] R[r14=6363636263636362] R[r 6=0000000000000000] inst=[7e670333] aes64ks2 t1, a4, t1 
C0:    1753567 [1] pc=[0000000000011488] W[r 7=6363636263636362][1] R[r 6=6363636263636362] R[r 7=0000000000000000] inst=[7e7303b3] aes64ks2 t2, t1, t2 
C0:    1753568 [1] pc=[000000000001148c] W[r12=0000003ffffff928][1] R[r12=0000003ffffff908] R[r 0=0000000000000000] inst=[02060613] addi    a2, a2, 32 
C0:    1753569 [1] pc=[0000000000011490] W[r 0=0000000000000000][0] R[r12=0000003ffffff928] R[r 6=6363636263636362] inst=[00663023] sd      t1, 0(a2)
C0:    1753570 [1] pc=[0000000000011494] W[r 0=0000000000000000][0] R[r12=0000003ffffff928] R[r 7=6363636263636362] inst=[00763423] sd      t2, 8(a2)
C0:    1753571 [0] pc=[0000000000011498] W[r14=0000000000000000][0] R[r 7=0000000031c39713] R[r28=0000000000000006] inst=[31c39713] aes64ks1i a4, t2, 12 
C0:    1753598 [1] pc=[0000000080000004] W[r 2=0000000080011ec0][1] R[r 2=0000003ffffff8f0] R[r 0=0000000000000000] inst=[34011173] csrrw   sp, mscratch, sp
C0:    1753601 [1] pc=[0000000080000008] W[r 0=0000000000000000][0] R[r 2=0000000080011ec0] R[r 0=0000000000000000] inst=[1a010863] beqz    sp, pc + 432

ZenithalHourlyRate avatar May 06 '22 12:05 ZenithalHourlyRate

Commit f2cab56 "Move pla into decode" improves the timing for ABLU. When synthesized in 100MHz (10ns), the worst slack is now 1.65 instead of 1.25.

ZenithalHourlyRate avatar May 09 '22 13:05 ZenithalHourlyRate

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: phthinh / name: Thinh Pham (ee289d44ed21b193ffd53968eef96b30df12195f, 386e5e99b6009ca4725c09f891e5c8884a9e2769, a3ba274e200619afedf8390b5f29adbb62d8fcdb, 35f02e11d504e7c5701689ee976814d1b54bfb62, c3278636d0c801970274e4cc295d9e27bf93ba7f, 36350c90a2148183b803346f09d7581e2b2df913)
  • :white_check_mark: login: ZenithalHourlyRate / name: Zenithal (5932acb74c3fd220cfe2599af786131db9ae565a, 5f4fcbb09258f45007eecb609d1c7d735e8fd833, da9d6b82424612ad07c9c67715657949ace5e686, 428f8080fc606e4aed0ad7244cbb4b1b4b892a88, 58dc1b536802c86477b5a250fb3be93fc3ee555f, a786f231966b4a1fc1d0cee3893fcf9a1077083c, 2eefbe0cbd18edd8a8bde2d6bb96a95b72b9ab9b, c5e0bd1bcf373fe99579975661d673d0a4160913, af97f3a18028b351cc65361fa64a063544fe3150, 0b46e1db18e6473ae6171fe8e6f5eb38343f2892, 380bb42a4c4702211024759b93678cf619779c2f, 009f0ed56a434071e07d9a76482e635bf5266622, f431ffd243394c8a331bfb3829d74ad47f18bbf8, 7ae146a64831b9102306b09b63a793c2d9f514d9, e2c81850a9a1d131883b7dae6a11de05eef8c84b, 2faae3f74685e8b8ab70b5d8d9ac9a6db21f0a4e, 265d4ee536b6c72a89498edd68ba5e2ae062c42b, 0ce3f74049b89d39067974096a7eb1ccc4c4d52b, d26e56199ed6ca8ad8e14e8364d7a40e396b1805, 76db33a7922b009464137aa6ae5bd15f43f6c142, d4acc70abfa5c0e73b218c0a3f0154b0689cbaf4, 7631b8ae2682f9ab4275d2d5f11c703a4afa00ae, 8240daf3a34b52fe59afa1a2e1c21b222b0df86d, 38728ef3e57ee226caf444d95fd745935b639c4d, 65bc35c19536e83f70784861748e3a2f8db0004f, d25c4b987b9827a70f1a83f1a05c1920c395f5a4, dce44b8e96722ec69d740c4748587840cd6d873a, 0a8d59fb595746c5bdeaeba3a2efc51f853044d1, 784b0e63a3599de35d37546fd570e031b8a5a91e, ab51f3fdfd4946c47133c622927a82c34dce5b35, 5fcd554ffdbfcbc940d91d7ef913a8daa813f15b, 3c294806a29f576f5ecae57812d9d77ff8bdad66, 960cf3e040a8072313fdde4f6b68dab809d8b7c2, 949498029155eb6e27e20584bf7d4ee26f1299a4, cd52e3ec42bd894819f877000d2023fd966595a9, da3972017418e0ba7b27bd13f4a86cf40cee9f09, 74f17e7affa6d7334ea82c585bee6474c2da71a1, de9efc49a145d93e3f48301a6b64123e835b307a, 181c1ade62a2ac0999c1c46ec880a76416da263c, 978118b2df56713bbabcde6babf166720e1a4309, fd7b7be2930a2ab976c98f88934cfb767cfebf82, 81db1537fcf9bf9092e44b165e72178dac61b548, 906cb6c021cb9929fdebdb77b07b9777773d9f27, 06688eb7ece207c379f73ab3da479675fc56be27, f39d059b6cb320e10aea13d5883444717fb5d879, c770dd972cbca2540ba1d55cba7b797081e873ea, 374de8f3dc2f143abc1e121df98f3b9a7a22e224, 8bab78ee34689db6e9fcbb06f5f422ca69ef606f, c0f226025594c513170482fdd29d7ffc5e6ca155, cc504cf4b9755a07b13b9919ce9b76d26802c93f, 773a11986cb96543b5f44b172792c594eea03fd7, ad3edffff97ad5752345cf8f2aae4f8ae441adb2, 18927be0a4f66e929e5dc2400a941b7903158906, 8240ffd06add867661e496e966091876198d9375, fcc711f048ef6f56ad6d02e2c56a67c26d401983, 8541647be7fc6229d2a8b798da7c666710b7bcc7, 37f089a5e4777c0c01f5da2454a454ae61fc1f86, 4f8114374d8824dfdec03f576a8cd68bebce4e56, c4c641a8631f7f0c915aa4a42827cb35fc898f5e, 1916f3aacc98e64e2d5cc5cc33535a37bbfd7aff, 1a76f5deeb1f51585b4d53e1978b915e4f19eeed, f2cab567824215835e639a787671af3510f25e2b, 076d041e02d08cc78500cdea004276f57f1eca87)
  • :white_check_mark: login: cyyself / name: Yangyu Chen (84574f8a9aa28bd0493d9ed55f150fd3cb99d0aa)

Merged master, redesigned control signals and did some cleanup.

The IDecode: use DW_XPR for branch insts needs special attention as it is a regression introduced by the new decoder. See more in its commit message.

We have many things to test, but the test framework is not ready (#2991), like both RV32 and RV64, enabling both Zk/Zb, enabling only Zk and enabling only Zb. For aes64ks1i, the illegal instruction exception codepath should also be covered in CI, it seems that arch-test has not tested that.

I also added some if condition to prevent instantiating some datapath when Zb is enabled but Zbk is not.

I mark this PR as ready for review now. Some utils that could be upstreamed to Chisel, if merged, still need a longer time to hit rocket chip so I assume adding them here and removing them in future PRs is appropriate.

ZenithalHourlyRate avatar Jun 15 '22 08:06 ZenithalHourlyRate

FYI, this PR is part of my undergraduate thesis (in Chinese) and I have made it public in my blog. Any one interested could find it here and there are two presentations (in English) on that thesis. I thought that presentation (and the figures in it) could be a documentation.

ZenithalHourlyRate avatar Jun 25 '22 17:06 ZenithalHourlyRate

This has passed riscv-arch-test on rv64i_m/K and rv32i_m/K when verilated to an emulator

Finally added these tests as CI and these tests passed locally; cherry-picked many commits from #3201 so should get merged after that PR (when the OOM issue gets fixed).

ZenithalHourlyRate avatar Jan 06 '23 02:01 ZenithalHourlyRate

Other cores depend on this stuff, and I need to think about making sure migration is not too painful (and does not cause bugs).

Agree! But notice eventually we should split rocket into a single package that only publish RocketCore as public api. For other cores depending on internal modules inside core will be deprecated.

sequencer avatar Jan 14 '23 02:01 sequencer

I will clone this branch myself to check some things.

jerryz123 avatar Jan 15 '23 19:01 jerryz123

Should squash&merge for this PR because the history is too long..

ZenithalHourlyRate avatar Jan 18 '23 02:01 ZenithalHourlyRate

Yes we should squash+merge. This PR looks fine for me now.

jerryz123 avatar Jan 18 '23 02:01 jerryz123

Once CI is done I will squash+merge

jerryz123 avatar Jan 18 '23 02:01 jerryz123

Once CI is done I will squash+merge

CI passed.

ZenithalHourlyRate avatar Jan 18 '23 05:01 ZenithalHourlyRate