QNICE-FPGA icon indicating copy to clipboard operation
QNICE-FPGA copied to clipboard

Daisy chain module edge case

Open sy2002 opened this issue 3 years ago • 9 comments

I stumbled over the following situation in the context of the refactoring of the timer module to use our new daisy chain module. Consider this situation:

  1. The timer fires and therefore requests an interrupt
  2. The CPU cannot handle it, yet, because it is currently in an ISR of that very timer. That ISR has the following task: deleting the ISR address of the timer (with the intention to stop the timer).
  3. The Daisy chain module is latching the request
  4. In the meantime, the ISR sucessfully sets the timer's ISR address register to zero (with the intention to stop the timer). That means: The timer now does not have any ISR address to provide any more. (Because the register was deleted.)
  5. The CPU now has time again to handle interrupts
  6. The Daisy chain module remembers, that there is a latched interrupt request from the timer module
  7. The handshake between the Daisy chain module and the CPU takes place
  8. The Daisy chain module pulls this_grant_n_o to zero
  9. 💥 💥 💥 The timer module puts the ISR address 0x0000 on the data bus 💣 💣 💣

sy2002 avatar Sep 09 '20 13:09 sy2002

This whole thing is admitably a very edgy edge case 😈

I am not even sure if it can happen in real hardware, but it can be easily reproduced using the Simulator. Here is a commit: e9d4f3deeafca2e4766806e924437044795003bf . If you do a git checkout e9d4f3d and then use Vivado's built in simulator on the simulation called dev_int, you will be able to see the situation at around 4,500,436 ns:

grafik

In vhdl/sim you'll find all the files needed for the simulation. Since I repaired it in Vivado, the simulation should start with "one click", but before doing so, you need to assemble dev_int.asm:

TITANIUM-M5:tools sy2002$ cd ../vhdl/sim/
TITANIUM-M5:sim sy2002$ ll
total 176
-rw-r--r--  1 sy2002  staff   4,7K  7 Sep 13:09 dev_int.asm
-rw-r--r--  1 sy2002  staff    51K  9 Sep 11:42 dev_int.lis
-rw-r--r--  1 sy2002  staff   2,1K  9 Sep 11:42 dev_int.out
-rw-r--r--  1 sy2002  staff   2,6K  9 Sep 11:42 dev_int.rom
-rw-r--r--  1 sy2002  staff   9,8K  9 Sep 12:56 dev_int.vhd
-rw-r--r--  1 sy2002  staff   2,1K  9 Sep 11:49 dev_int_globals.vhd
-rw-r--r--  1 sy2002  staff   2,6K  7 Sep 13:09 dev_int_source.vhd

The overarching question is: Do we want to fix this at all?

This situation is not very likely to occur in hardware.

If we would like to fix it, then one idea might be to add a "I changed my mind" input to the Daisy chain module, which works like this: If the signal is asserted AND we are still waiting until its our turn (i.e. the CPU was not even able to look at our request) THEN the interrupt request latch is deleted.

Happy to discuss. Here or via email or via video call.

sy2002 avatar Sep 09 '20 14:09 sy2002

Definitely something we want to fix. I'll need to think about it.

MJoergen avatar Sep 09 '20 14:09 MJoergen

Just in case our messages overlapped, I added this comment: https://github.com/sy2002/QNICE-FPGA/issues/123#issuecomment-689586438

sy2002 avatar Sep 09 '20 14:09 sy2002

Wouldn't the timer deassert the /INT line as soon as one of its registers became 0? Thus it would no longer be part of the interrupt daisy chain setup?!

bernd-ulmann avatar Sep 09 '20 14:09 bernd-ulmann

@bernd-ulmann Hi Bernd - I guess there is a misunderstanding on your side. Let me explain:

As the correct and edge case proven Interrupt and Daisy chain protocol is not-trivial, Michael and I came up with the idea to have a reusable Daisy chain and interrupt protocol handler device that we can just plug-in to all our interrupt enabled hardware. Currently this is "just" the VGA and the Timer, but in future there will be more, so we wanted to make sure, that not every new device needs to be "battle hardened" over and over again.

The timer interrupt as it is released in V1.6, is using it's own Daisy chain logic. And there, everything works fine.

Now, here in the develop branch of V1.7, we started to bring together this new generic and reusable Daisy chain handling module in vhdl/daisy_chain.vhd with the devices. VGA is already using it and today I refactored the timer interrupt to use it, too.

While I did this refactoring, I discovered, that the module, which is absolutely great in itself and handles a lot of complexity, has this edge case issue described here.

If you want to learn more, here is the documentation about this new module. You can compare this module to a library in C that contains the functionality, so that not each and every device we make needs to have code for this logic again and again.

sy2002 avatar Sep 09 '20 14:09 sy2002

After some thought, I've come up with three possible solutions. These are (in no particular order):

A: Remove the latching mechanism from the daisy_chain module. This brings the design closer to V1.6 and is in line with Bernd's suggestion. In this case the device would need to remember that it wants to request an interrupt, and should detect when the request has been granted, so that the device can de-assert the interrupt request line. This could potentially generate a combinatorial loop in the hardware, but that is still to be investigated.

B: Have the CPU treat the ISR address = 0x0000 as a special case. In other words, if the CPU receives an ISR address of 0x0000, the CPU should ignore the interrupt request and therefore not set the Program Counter. This requires changing the CPU, and giving the CPU knowledge of a "special" value. However, the existing devices (timer and VGA) currently treat the ISR address 0x0000 as a special case. On the other hand, I can think of a situation where the application wants to implement a "watch-dog" functionality and have a timer interrupt specifically point to address 0x0000 (which is the Cold Reset). I.e. a user might indeed want the system to Cold Reset if a given timer interrupt occurs. This is not possible with the current implementation of the timer module.

C: Other computer architectures have a separate interrupt enable register, so we could do that too. That way, the device will no longer treat the ISR address 0x0000 as a special value. When an application wants to enable interrupts, the program must first set the ISR address and then set the enable bit. When the application wants to disable interrupts, the program must clear the enable bit (but leave the ISR address untouched). The race condition mentioned in this issue can still occur, in that the instruction disabling interrupts might be in the process of being executed when the interrupt request fires. That would mean the ISR is entered AFTER the programmer has disabled interrupts.

Looking forward to our call on Tuesday to discuss this issue!

MJoergen avatar Sep 10 '20 13:09 MJoergen

What call? I missed something, it seems... :-)

Regarding your suggestions: All three are great. The easiest one to implement might be to give each device capable of issuing interrupts a command and status register (CSR) with a dedicated interrupt enable bit so the device would not have to test for an ISR of zero. ...but this would not clear an already pending interrupt which would be serviced nevertheless but at least with the correct ISR, right?

Didn't we originally say that the timer module is disabled if any of the three timer registers contains zero? In this case we could avoid the problem by just setting the PRE or CNT register to zero instead of chainging INT or am I missing something here?

bernd-ulmann avatar Sep 10 '20 14:09 bernd-ulmann

@MJoergen re (A) I think the latching is super-awesome. It prevents large challenges for the implementors and takes away the burden of the "wait until it is your turn" logic. So we should keep it like that.

Let's discuss on Tuesday.

sy2002 avatar Sep 10 '20 23:09 sy2002

Should be solved by issue #136. As soon as #136 is done, the simulation test program needs to be rewritten and then the edge case needs to be retested.

sy2002 avatar Sep 15 '20 21:09 sy2002