cva6
cva6 copied to clipboard
CVA6 hangs on store sequence
I have a CVA6 hang that was discovered running random tests cases in an emulation environment. I am in email contact with Florian, and will send him simulation waveforms of the hang. Here is a brief description of the failing sequence as seen in Spike and the CVA6 instruction trace. The CVA6 stops printing instruction data on a store byte. In the sequence shown below, the spike model issues a couple of store bytes, a missed branch, a taken branch to a load, and a page fault taken on that load. The CVA6 trace stops following the first store byte. I'm not sure if this is the last instruction retired from the CVA6 or if the trace dump stopped earlier than the actual hang. The AXI bus interface shows no unusual activity. There are no errors and all requests have been responded to. The code and trace information is as follows:
Code sequence:
0000000080000090 <interrupt_handlers>:
80000090: 34102a73 csrr s4,mepc
00000000800009ba <loop_33>:
800009ba: fca6c003 lbu zero,-54(a3)
800009be: 005686a3 sb t0,13(a3)
800009c2: 0fe0e593 ori a1,ra,254
800009c6: 117d addi sp,sp,-1
800009c8: 02b206bb mulw a3,tp,a1
800009cc: 28257493 andi s1,a0,642
800009d0: 418d li gp,3
800009d2: 02e48ca3 sb a4,57(s1)
800009d6: e2368493 addi s1,a3,-477
800009da: 47060213 addi tp,a2,1136
00000000800009de <loop_34>:
800009de: fcc48b23 sb a2,-42(s1)
800009e2: 02d606b3 mul a3,a2,a3
800009e6: 02558ba3 sb t0,55(a1)
800009ea: 11fd addi gp,gp,-1
800009ec: fdc5c303 lbu t1,-36(a1)
800009f0: fe948ea3 sb s1,-3(s1)
800009f4: 000489a3 sb zero,19(s1)
800009f8: fe0193e3 bnez gp,800009de <loop_34>
800009fc: fa011fe3 bnez sp,800009ba <loop_33>
Spike trace:
core 0: 1 0x00000000800009e2 (0x02d606b3) x13 0x33dbb07401000000
core 0: 1 0x00000000800009e6 (0x02558ba3) mem 0x0000000080060077 0xed
core 0: 1 0x00000000800009ea (0x11fd) x 3 0x0000000000000000
core 0: 1 0x00000000800009ec (0xfdc5c303) x 6 0x000000000000003d mem 0x000000008006001c
core 0: 1 0x00000000800009f0 (0xfe948ea3) mem 0x000000008007fe60 0x63
core 0: 1 0x00000000800009f4 (0x000489a3) mem 0x000000008007fe76 0x0
core 0: 1 0x00000000800009f8 (0xfe0193e3)
core 0: 1 0x00000000800009fc (0xfa011fe3)
core 0: 3 0x0000000080000090 (0x34102a73) x20 0x00000000800009ba
CVA6 trace:
192785ns 19178 S 00000000800009e2 0 02d606b3 mul a3, a2, a3 a3 :33dbb07401000000 a2 :0000000080070040 a3 :8708404160040000
192805ns 19180 S 00000000800009e6 0 02558ba3 sb t0, 55(a1) t0 :000000008007fbed a1 :0000000080060040 VA: 0000000080060077 PA: 00000080060077
192805ns 19180 S 00000000800009ea 0 000011fd c.addi gp, gp, -1 gp :0000000000000000 gp :0000000000000001
192835ns 19183 S 00000000800009ec 0 fdc5c303 lbu t1, -36(a1) t1 :000000000000003d a1 :0000000080060040 VA: 000000008006001c PA: 0000008006001c
192955ns 19195 S 00000000800009f0 0 fe948ea3 sb s1, -3(s1) s1 :000000008007fe63 s1 :000000008007fe63 VA: 000000008007fe60 PA: 00000080060e60
First of all sorry for the large delay. We had issues with the licenses so that needed to be resolved first.
Indeed I can confirm that there seems to be an issue with the LSU, interesting to see such a thing appearing now, I honestly wouldn't have expected it. It seems the the load unit violates the protocol to the cache unit when in the WAIT_GNT state and an exception is incoming (it doesn't properly kill the transaction to the cache) which then blocks the store unit from making forward progress, hence the lock-up.
I am not 100% sure about the solution. But what I would essentially add is the case to properly terminate the transaction on: https://github.com/openhwgroup/cva6/blob/e4b48a794b0229eeb9264a6b3f459c86545d1d85/core/load_unit.sv#L247
if (state_d != ABORT_TRANSACTION_NI || state_d != ABORT_TRANSACTION) state_d = Idle;
Hi @zarubaf and @JeanRochCoulon, it seems this is a valid RTL bug with a known solution. However, the RTL has not be updated on either the master or cv32a6_v5.0.0 branches.
Hello @mpdickman, I know that the delay is large since you have discovered this issue. But, if we release a fixed version of CVA6, would you be able to check the fix by running your use case ?
Closing because of staleness.