ibex icon indicating copy to clipboard operation
ibex copied to clipboard

[dv] Integrity error generation in memory busses

Open ctopal opened this issue 2 years ago • 14 comments

1st commit changes the old memory_error_seq so that it uses the new sequence library.

2nd commit is a bug-fix guaranteeing not injecting an error while handshaking.

3rd commit allows us to stop/start InfiniteRuns type of sequences. Necessary for stopping injecting errors while we are in an IRQ handler.

4th commit improves memory interface to have an integrity error response on demand, also changes the memory_error_seq to inject integrity errors as well as bus errors. Also introduces a new seperate test named riscv_mem_intg_err_test that randomly picks between having an integrity error or a bus error on a memory response to a request.

5th commit integrates internal NMI handling to our cosim environment by driving nmi_int in Spike and set internal NMI traps as synchronous.

6th commit updates CI to have latest version of ibex_cosim branch of Spike.

7th commit is a simple fix for tying up pc_wdata of RVFI interface

8th commit is an RTL change about allowing the register write in the case of an integrity error. Reason being, if we don't let this instruction execute, this means the interrupt caused by this instruction does need to happen pretty much instantly. And letting Spike know about that behaviour (we retired a faulty instruction that caused an interrupt) was challenging. Instead, after the load/store we now flag the next RVFI package (with the jump instruction) with int_nmi causing Spike to match with Ibex.

For more details please check commit messages.

Resolves #1731 Resolves #1732

ctopal avatar Sep 20 '22 13:09 ctopal

This does not take into account the type of memory request. That means we won't differentiate between seeing an integrity error on the memory interface while doing a store or a load (either way if rdata integrity bits are faulty, we're seeing an internal NMI -is this okay? WDYT @GregAC ). I believe this might not be ideal and we should only send an integrity error while we are doing a read. Should I change it?

ctopal avatar Sep 22 '22 16:09 ctopal

That means we won't differentiate between seeing an integrity error on the memory interface while doing a store or a load (either way if rdata integrity bits are faulty, we're seeing an internal NMI -is this okay? WDYT @GregAC ). I believe this might not be ideal and we should only send an integrity error while we are doing a read. Should I change it?

With the current Ibex RTL integrity errors on rdata for writes are ignored. However we're planning on going back to checking it (see https://github.com/lowRISC/opentitan/issues/14611) so the current behaviour is fine.

GregAC avatar Sep 23 '22 14:09 GregAC

Oh and we do need a check that an integrity error triggers an alert. This can be done as an assert for the data side. Instruction side requires a little more thought (a fetch may not actually get executed so the integrity error doesn't get consumed and doesn't trigger an alert). Though strictly that's a V2S task. So for this PR at least we don't need to worry about it, but we must not forget about it (best to create an issue to track).

GregAC avatar Sep 23 '22 14:09 GregAC

While simulating with this version I found a minor mismatch regarding two back to back integrity errors (with load instructions). RTL side does not take the faulty data from memory into the register, Spike does.

I might need help about how to tell Spike to not go into a fault but also not process the retired instruction.

ctopal avatar Oct 14 '22 01:10 ctopal

Test pass rate for this memory test is now around 60%. We hit internal NMI coverpoints though. I suspect almost all failures are related with the issue that I mentioned in the previous message.

ctopal avatar Oct 14 '22 12:10 ctopal

As discussed in the Ibex DV Sync I think we should avoid generating integrity errors in the riscv_mem_error_test and introduce a new riscv_mem_error_intg_test that does have integrity error. Just add a config option for core_ibex_mem_error_test that enables generating integrity errors and a new entry to the testlist that turns that on.

With that done this can pass CI and we can get it merged. If the new test pass rate is 60% that should do for our V2 targets and we can fix up remaining issues in V3.

GregAC avatar Oct 14 '22 15:10 GregAC

I think it might be cleaner to split out the cosim/NMI changes into a separate PR? Those changes are quite significant on their own. And could you add some more detail to the PR comment, the messages for 7222d4a012ba1b28b09cb5bd4d0f39cd5f70e9ac and 66f3e2dfc4416ba8d831f4200d525955f6fd518b are a good start but I think some more context would be really useful in the future.

hcallahan-lowrisc avatar Oct 17 '22 09:10 hcallahan-lowrisc

Thanks @hcallahan-lowrisc for having a look, I updated the commit messages for almost all of them and dropped that one unnecessary commit about catching external RVFI signals -that one should belong to another PR I agree. Latest changes should make the mem_test pass in CI (as can be seen in Ibex CI + Private CI).

Might worth just another quick look @GregAC since now it includes cosim integration too.

PS Failures on the CI says cancelled,I'm not sure why

ctopal avatar Oct 17 '22 15:10 ctopal

I think this PR increases our test time for mem_error_test (after changing it with the new sequence we probably are injecting way more memory errors) and tips it over 30 mins. What should we do about it?

ctopal avatar Oct 18 '22 09:10 ctopal

I think this PR increases our test time for mem_error_test (after changing it with the new sequence we probably are injecting way more memory errors) and tips it over 30 mins. What should we do about it?

Let's try dropping the frequency of memory errors and see what that does to the run time.

GregAC avatar Oct 19 '22 08:10 GregAC

@GregAC can you have a look at https://github.com/lowRISC/ibex/pull/1811/commits/66d8d0fc0b087fa48f74cf8e0d343b582e2ad8e9 to see if this makes sense?

ctopal avatar Oct 20 '22 17:10 ctopal

@GregAC can you have a look at 66d8d0f to see if this makes sense?

It does make sense, though having reviewed our discussion I think we want to leave the RTL behaviour as is. This means we need to instead alter cosim to not do the register writes on ECC failures. I will take a look at this.

GregAC avatar Oct 21 '22 13:10 GregAC

@GregAC can you have a look at 66d8d0f to see if this makes sense?

It does make sense, though having reviewed our discussion I think we want to leave the RTL behaviour as is. This means we need to instead alter cosim to not do the register writes on ECC failures. I will take a look at this.

I've worked out how I'm going to do this, just need to implement it. I'm going to use a method that doesn't require spike changes (as otherwise these would be quite invasive to suppress a register write on a load).

GregAC avatar Oct 24 '22 19:10 GregAC

I've added some stuff and reworked commits here to give https://github.com/lowRISC/ibex/pull/1811 which supersedes this PR and should get the memory integrity test running and passing.

GregAC avatar Nov 04 '22 21:11 GregAC