cva6 CVA6 hangs on store sequence

I have a CVA6 hang that was discovered running random tests cases in an emulation environment. I am in email contact with Florian, and will send him simulation waveforms of the hang. Here is a brief description of the failing sequence as seen in Spike and the CVA6 instruction trace. The CVA6 stops printing instruction data on a store byte. In the sequence shown below, the spike model issues a couple of store bytes, a missed branch, a taken branch to a load, and a page fault taken on that load. The CVA6 trace stops following the first store byte. I'm not sure if this is the last instruction retired from the CVA6 or if the trace dump stopped earlier than the actual hang. The AXI bus interface shows no unusual activity. There are no errors and all requests have been responded to. The code and trace information is as follows:

Code sequence:

0000000080000090 <interrupt_handlers>:
    80000090:	34102a73          	csrr	s4,mepc

00000000800009ba <loop_33>:
    800009ba:	fca6c003          	lbu	zero,-54(a3)
    800009be:	005686a3          	sb	t0,13(a3)
    800009c2:	0fe0e593          	ori	a1,ra,254
    800009c6:	117d                	addi	sp,sp,-1
    800009c8:	02b206bb          	mulw	a3,tp,a1
    800009cc:	28257493          	andi	s1,a0,642
    800009d0:	418d                	li	gp,3
    800009d2:	02e48ca3          	sb	a4,57(s1)
    800009d6:	e2368493          	addi	s1,a3,-477
    800009da:	47060213          	addi	tp,a2,1136

00000000800009de <loop_34>:
    800009de:	fcc48b23          	sb	a2,-42(s1)
    800009e2:	02d606b3          	mul	a3,a2,a3
    800009e6:	02558ba3          	sb	t0,55(a1)
    800009ea:	11fd                	addi	gp,gp,-1
    800009ec:	fdc5c303          	lbu	t1,-36(a1)
    800009f0:	fe948ea3          	sb	s1,-3(s1)
    800009f4:	000489a3          	sb	zero,19(s1)
    800009f8:	fe0193e3          	bnez	gp,800009de <loop_34>
    800009fc:	fa011fe3          	bnez	sp,800009ba <loop_33>

Spike trace:

   core   0: 1 0x00000000800009e2 (0x02d606b3) x13 0x33dbb07401000000
   core   0: 1 0x00000000800009e6 (0x02558ba3) mem 0x0000000080060077 0xed
   core   0: 1 0x00000000800009ea (0x11fd) x 3 0x0000000000000000
   core   0: 1 0x00000000800009ec (0xfdc5c303) x 6 0x000000000000003d mem 0x000000008006001c
   core   0: 1 0x00000000800009f0 (0xfe948ea3) mem 0x000000008007fe60 0x63
   core   0: 1 0x00000000800009f4 (0x000489a3) mem 0x000000008007fe76 0x0
   core   0: 1 0x00000000800009f8 (0xfe0193e3)
   core   0: 1 0x00000000800009fc (0xfa011fe3)
   core   0: 3 0x0000000080000090 (0x34102a73) x20 0x00000000800009ba

CVA6 trace:

  192785ns    19178 S 00000000800009e2 0 02d606b3 mul            a3, a2, a3            a3  :33dbb07401000000 a2  :0000000080070040 a3  :8708404160040000
  192805ns    19180 S 00000000800009e6 0 02558ba3 sb             t0, 55(a1)            t0  :000000008007fbed a1  :0000000080060040 VA: 0000000080060077 PA: 00000080060077
  192805ns    19180 S 00000000800009ea 0 000011fd c.addi         gp, gp, -1            gp  :0000000000000000 gp  :0000000000000001
  192835ns    19183 S 00000000800009ec 0 fdc5c303 lbu            t1, -36(a1)           t1  :000000000000003d a1  :0000000080060040 VA: 000000008006001c PA: 0000008006001c
  192955ns    19195 S 00000000800009f0 0 fe948ea3 sb             s1, -3(s1)            s1  :000000008007fe63 s1  :000000008007fe63 VA: 000000008007fe60 PA: 00000080060e60

Feb 04 '22 18:02 mpdickman

First of all sorry for the large delay. We had issues with the licenses so that needed to be resolved first.

Indeed I can confirm that there seems to be an issue with the LSU, interesting to see such a thing appearing now, I honestly wouldn't have expected it. It seems the the load unit violates the protocol to the cache unit when in the WAIT_GNT state and an exception is incoming (it doesn't properly kill the transaction to the cache) which then blocks the store unit from making forward progress, hence the lock-up.

I am not 100% sure about the solution. But what I would essentially add is the case to properly terminate the transaction on: https://github.com/openhwgroup/cva6/blob/e4b48a794b0229eeb9264a6b3f459c86545d1d85/core/load_unit.sv#L247

if (state_d != ABORT_TRANSACTION_NI || state_d != ABORT_TRANSACTION) state_d = Idle;

Feb 27 '22 16:02 zarubaf

Hi @zarubaf and @JeanRochCoulon, it seems this is a valid RTL bug with a known solution. However, the RTL has not be updated on either the master or cv32a6_v5.0.0 branches.

Feb 17 '23 02:02 MikeOpenHWGroup

Hello @mpdickman, I know that the delay is large since you have discovered this issue. But, if we release a fixed version of CVA6, would you be able to check the fix by running your use case ?

Mar 16 '23 07:03 JeanRochCoulon

Closing because of staleness.

Oct 20 '23 10:10 zarubaf

cva6 cva6 copied to clipboard

CVA6 hangs on store sequence

cva6
cva6 copied to clipboard