cva6 icon indicating copy to clipboard operation
cva6 copied to clipboard

[BUG] Returns to address zero can be ignored

Open PRugg-Cap opened this issue 6 months ago • 3 comments

Is there an existing CVA6 bug for this?

  • [x] I have searched the existing bug issues

Bug Description

Returns don't trigger bp_valid in the frontend when the RAS is empty, so fetch doesn't get redirected:

https://github.com/openhwgroup/cva6/blob/301d11ceb88c1169f75e9dea415e4bff4eb29888/core/frontend/frontend.sv#L306

However, returns mark the cf_type as Return, even if the RAS entry isn't valid:

https://github.com/openhwgroup/cva6/blob/301d11ceb88c1169f75e9dea415e4bff4eb29888/core/frontend/frontend.sv#L260-L264

The RAS returns predict_address zero when it's empty.

The mispredict check in the branch unit checks if the predicted address differs from the target address:

https://github.com/openhwgroup/cva6/blob/301d11ceb88c1169f75e9dea415e4bff4eb29888/core/branch_unit.sv#L101-L103

The problem is that if the architecturally correct target address actually is zero then the instruction just retires as basically a nop.

Found by fuzzing against the Sail model using TestRIG: I didn't reproduce it on a more up-to date CVA6 using a more traditional test, but looking through the code the same problem seems to apply. It's obviously a fairly esoteric bug, but I think jumping to address zero should be legal. If people think it's important, I can try to find time to create a reproducer.

Three ideas for fixes:

  • Add || branch_predict_i.predict_address == '0 to the mispredict check. I've confirmed this fixes the bug, but introduces some extra gates. It would also be very bad for performance if you were actually jumping to address zero repeatedly for some reason.
  • Set the LSB of the RAS predict address if it's invalid: since target addresses will never have LSBs set, this guarantees the mispredict equality check will fail. I've confirmed this fixes the bug. We could actually put the valid bit in the RAS in the LSB of the returned address (and swap the sense) to save a bit of storage per entry.
  • Only set cf_type == Return if the prediction is valid. I haven't tested this one. That means that the instruction will get detected as a mispredict.

PRugg-Cap avatar Jun 23 '25 14:06 PRugg-Cap

Thanks for this interesting feedback !

@AnouarZajni do you think it is related to the RAS limitation you have highlighted some time ago?

JeanRochCoulon avatar Jul 03 '25 21:07 JeanRochCoulon

Hi, I encountered the same bug applying formal tools. The main problem is that it causes a decoherence between the frontend's pc and the scoreboard's pc. It appears to act like a NOP because the frontend's pc does not "jump". So, the core keeps executing instructions at the next addresses instead of jumping. However, the "backend"'s pc is modified so both pc are not in a coherent state anymore.

Example scenario

At the top, the npc_q is the pc on the frontend side, fetch_paddr is the address matching with the fetched instruction and fetch_entry_o[0].address is the pc on the backend side.

Here, there are 2 jumps, but only the first one triggers the bug : the first jump is a return : jalr x9, x1, 72 that is fetched from address 0x7190 without any address prediction (predicting 0 by default).

We can see in the frontend's output, that the PC of the second jump is wrongly updated, its pc in the scoreboard is 0 instead of 0x7194. This seems to happen when the cf_type is a Return, but usually, after jumping, a flush will be triggered and these instructions won't be commited.

However, in this case, the real jump address is 0, so the unit branch observes that the prediction is actually correct and does not trigger a flush and so it keeps executing instructions it shouldn't.

The optimal fix would be to remove the separation between both PC, because futur changes could break this coherence. For example, even if the second jump target address was wrongly calculated, the core will still jump on this address, because the instruction is a JAL and no misprediction are supposed to happen on JAL.

Image

branch: cv32a60x | targeted configuration: cv32a60x | targeted design: cva6_pipeline.sv


Product: Questa OneSpin Solutions App: Questa Processor App Tool's version: 2025.1

TeoBernier avatar Jul 10 '25 16:07 TeoBernier

Agreed: this is exactly the bug I was seeing. Thanks for presenting a concrete example!

PRugg-Cap avatar Jul 10 '25 18:07 PRugg-Cap