wb2axip icon indicating copy to clipboard operation
wb2axip copied to clipboard

Question regarding AW / W channel dependencies

Open tristanitschner opened this issue 3 years ago • 7 comments

Note: This is not an issue, but merely a notice to Mr. ZipCPU. Since I am certain from reading his blog that he is a very busy man indeed, I thought this is the right place to put this.

I came here after reading the blog post "The hard part of building a bursting AXI Master," which was very helpful for me indeed, since I'm about to write an AXI DMA myself. (Once I have the time.) Here the problem of generating a AW transaction was solved by the use of "phantom signals," or to put it in other words, 'doing the AW calculation for the next burst transfer in the three cycles at the beginning of the current burst transfer.' However, by using Xilinx's core I came previously across the fact that W transactions may take place before the associated AW transaction and this is also explicitly stated in the AXI specification:

"This means, for example, that the write data can appear at an interface before the write address for the transaction. This can occur if the write address channel contains more register stages than the write data channel. Similarly, the write data might appear in the same cycle as the address." (A3.3, AMBA AXI and ACE Protocol Specification)

That is, this problem may also be solved by simply delaying the AW transaction by the required amount of cycles, while issuing the W requests right away. (Xilinx's deprecated AXI Master Burst does this and issues the AW transaction 3 cycles later.)

I have observed in your design, that in burst transactions the AW transaction is always issued synchronously to the first associated W transaction. (Such as in Fig. 10 of aforementioned blog post.) Is there any practical reason you obey to this additional rule? In the past I've heard about IP cores that do not work properly because they aren't designed for this case. Maybe this is the reason?

tristanitschner avatar May 15 '22 19:05 tristanitschner

Is there any practical reason you obey to this additional rule?

Yes, there is. I haven't figured out how to formally verify an interface where the write data arrives prior to the write address data, or (worse) where the write address data is more than one burst ahead of the write data. It's a limitation in my AXI formal verification suite, and one that I'd like to correct, but that limitation ends up imposing additional restrictions on the IP I write.

But let me back up to your previous comment, where you suggest that the problem of determining burst boundaries "may also be solved by simply delaying the AW transaction by the required number of cycles, while issuing the W requests right away." This won't actually fix the problem, since the information in the W requests is tightly linked to the AW requests. The most challenging examples of this is WLAST. WLAST must be set based upon AWLEN. If you determine the first AWLEN must be zero, but do so only 4 clocks into the burst, then you've missed your first required WLAST. As a result, this would be a broken design rather than a better design.

ZipCPU avatar May 17 '22 03:05 ZipCPU

Thinking about this some more ...

  1. As mentioned above, W data before AW makes no sense, unless you know how to set WLAST. WLAST depends on AWLEN, and that depends on the 4kB address limit so that depends on AWADDR. If you know AWADDR and AWLEN, then, why not issue AW at the same time?
  2. W data before AW offers no benefit: the interconnect must hold all W data arriving before AW in a buffer, until it knows which slave to route the W data to. This, however, depends on AWADDR.
  3. If your data isn't aligned, then it will be important to shift it properly. The appropriate shift depends on the low order address bits of AWADDR--again, preventing W from being issued prior to AW.
  4. The biggest benefit to be had would be issuing a second or subsequent AW request prior to the end of the first WLAST. This isn't W before AW, but rather AW before W. In this way, you can keep the routing resources loaded in the interconnect. However ...
  5. Neither AW nor W requests should be issued prior to the write data being available. The general rule of bus interactions is to do your business on the bus and get off as fast as possible. (It's also a common bathroom rule in a large house: do your business or get off the pot.) This is to keep bus resources from being consumed unnecessarily. In most common cases the bus can not transfer write data any faster than it arrives in the DMA. For these common cases, there will only be a limited ability/benefit in issuing AW early.

So ... there's more to the issue.

ZipCPU avatar May 31 '22 12:05 ZipCPU

Hello Mr. ZipCPU,

whether the AXI standard makes sense or not is indeed a totally different question. I never thought about the WLAST signal, which is of course redundant, if you know the associated AW transaction and keep a counter, so I can only substantiate your point in this regard.

Regarding the other points, I have nothing to object except maybe the ASIC circuitry inside the DDR controller inside a Zynq MPSoC. Since DDR is organised in banks / pages, it benefits from knowing the address before the data and can fetch the associated page beforehand. More complex scheduling schemes akin to an interconnect are of course also possible. However, this problem may also be solved by using large enough FIFOs, which I find to be the simpler solution. On the other hand, the ASIC runs at a much higher frequency than the programmable logic, which counteracts this advantage.

This brings me to my last point concerning your first response: Verifying an IP where AW channel and W channel are not synchronous. This problem can also be solved by using FIFOs, which resynchronize the signals at the output. However, this complicates the testbench and I suspect that formal tools won't like it from a performance standpoint. It is also identical to using a software FIFO, for which one can use the C++ standard libraries when using Verilator, or using the allocate / pointer features of VHDL. (Which I do not recommend at all.)

So I would say we could agree on the fact, whether intended or not, that the AXI specification keeps the FPGA engineers busy at work.

tristanitschner avatar May 31 '22 18:05 tristanitschner

I forgot one thing: Another point of criticism of the AXI protocol is the use of separate ADDR channels. In most cases (from my experience) you either have a component that accepts data and outputs the processed data on a master AXI channel on its own, or the other way around. (If you're not using the stream protocol anyways.) And then there also is a registers-mapped control interface, from which you either read or write, but almost never both at the same time. So what the inventor of the SpinalHDL language did, was to come up with an interconnect, which combines the two ADDR channels of the AXI protocol for every connection. He claims that this saves a lot of logic. This might be something of interest to you.

tristanitschner avatar May 31 '22 18:05 tristanitschner

Verifying an IP where AW channel and W channel are not synchronous. This problem can also be solved by using FIFOs, which resynchronize the signals at the output.

I've had significant problems verifying IP components containing FIFOs. It's not that easy. While a FIFO is an easy thing to verify on its own, verifying something that uses a FIFO can be a particular challenge--and one that I'm not yet comfortable handling, even though I've had to do it (by now) for a couple of designs already.

A classic example: Verify that the number of beats in a FIFO with a given ID matches a known counter. That's on the easy end. A bit harder is to verify the number of LASTs in a FIFO (i.e. bursts), and then to match that to the number of beats in the FIFO. Now repeat that for a given ID. It can become a real challenge.

... the ASIC circuitry inside the DDR controller inside a Zynq MPSoC. Since DDR is organised in banks / pages, it benefits from knowing the address before the data and can fetch the associated page beforehand

This is a red herring. You can't get data to the DDR memory controller without first going through an address decoder to determine if the DDR memory is the correct slave to route the data to. Hence, when the data arrives at the DDR, the address must already be available.

So what the inventor of the SpinalHDL language did, was to come up with an interconnect, which combines the two ADDR channels of the AXI protocol for every connection. He claims that this saves a lot of logic.

This is an engineering decision involving tradeoffs. Many engineers have made different decisions here. I've chosen to keep the two channels separate for the time being. Xilinx (at one time) merged the two channels as well. But, like I said, there are tradeoffs involved in doing this. Once such tradeoff is in logic. There's another tradeoff in terms of throughput and latency. Another tradeoff involves fanout. Don't forget, though, that the exclusive access protocol adds some additional requirements as well.

ZipCPU avatar Jun 01 '22 17:06 ZipCPU

A classic example: Verify that the number of beats in a FIFO with a given ID matches a known counter. That's on the easy end. A bit harder is to verify the number of LASTs in a FIFO (i.e. bursts), and then to match that to the number of beats in the FIFO. Now repeat that for a given ID. It can become a real challenge.

Well, I would go about this as follows: From the LAST signal a FIRST signal can be easily recovered. (I do this all the time.) I assume by "inside the fifo" it is meant that the whole packet is inside the fifo. Then there are three options (corresponding somewhat to a crude fifo implementation):

  1. There is a transaction with LAST at the input -> counter + 1
  2. There is a transaction with START at the output -> counter - 1
  3. Both happen at the same -> counter doesn't change

Now when the ID is recorded alongside every beat, extending this to multiple IDs is really straight forward.

As a sidenote: I noticed that verifying a core, whether using formal methods or not, always comes very close to recreating the logic that is actually inside that core.

This is a red herring. You can't get data to the DDR memory controller without first going through an address decoder to determine if the DDR memory is the correct slave to route the data to. Hence, when the data arrives at the DDR, the address must already be available.

Now add to that the logic for cache coherency when using a cache controller.

tristanitschner avatar Jun 04 '22 22:06 tristanitschner

I do this all the time

Do you formally verify your IP all the time? The problems listed above are specific to formal verification. They are not necessarily problems at all when using simulation--save that formal hits all corner cases in (typically) about 5 steps, whereas simulation can take millions of steps and still not hit the states with errors in them.

ZipCPU avatar Jun 05 '22 02:06 ZipCPU

Given that "this is not an issue", I'm going to close this issue at this time.

ZipCPU avatar Nov 09 '22 21:11 ZipCPU