AXI4ToTL doesn't handle AXI4 ordering model

Open tymcauley opened this issue 5 years ago • 0 comments

Type of issue: bug report | feature request

Impact: API modification

Development Phase: proposal

Other information See conversation from Chipyard Google Group: https://groups.google.com/g/chipyard/c/dGCYZZB12is

If the current behavior is a bug, please provide the steps to reproduce the problem:

What is the current behavior?

Summary

The AXI4ToTL widget currently depends on only being able to send one TileLink transaction at a time. Effectively, it must be paired with a TLFIFOFixer, or it can generate illegal TileLink transactions (multiple outstanding transactions with the same source ID), and also return illegally-ordered AXI4 responses. This is a performance issue (only allowing one in-flight transaction severely limits AXI4 bandwidth, especially when paired with a high-latency TileLink memory system), and then turns into a logic bug when trying to fix the performance issue.

Expected behavior

Right now, we depend on pairing an AXI4ToTL widget with an AXI4UserYanker to define the maxFlight parameter in AXI4ToTL. This defines the maximum supported in-flight transactions for each AXI ID.

The "Operation Ordering" section of the TileLink spec says:

Within a TileLink network, there may be multiple outstanding operations inflight at any given time. These operations may be completed in any order.

Section 5.1 of the AXI4 spec says:

All transactions with a given AXI ID value must remain ordered, but there is no restriction on the ordering of transactions with different ID values.

These semantics are different between TileLink and AXI4, in particular AXI4 has special ordering restrictions within a single AXI ID. When maxFlight is set to 1, there are no ordering issues, since each AXI ID can only have a single transaction in flight at once, and there are no ordering constraints between transactions with different AXI IDs. In this case, the AXI4 ordering model matches TileLink's ordering model. However, once maxFlight exceeds 1, the AXI network depends on AXI4ToTL to maintain ordering of transactions within the same AXI ID, which is not the current behavior.

Current behavior (read requests)

Looking at the example of read requests, this is the logic AXI4ToTL uses to generate multiple TileLink requests for AXI requests on a single AXI ID:

val r_count = RegInit(Vec.fill(numIds) { UInt(0, width = txnCountBits) })
val r_id = if (maxFlight == 1) {
  Cat(in.ar.bits.id, UInt(0, width=1))
} else {
  Cat(in.ar.bits.id, r_count(in.ar.bits.id)(logFlight-1,0), UInt(0, width=1))
}

So, r_count is a vector of counters, one counter for each supported AXI ID. Here's the code for how those counters are incremented:

val r_sel = UIntToOH(in.ar.bits.id, numIds)
(r_sel.asBools zip r_count) foreach { case (s, r) =>
  when (in.ar.fire() && s) { r := r + UInt(1) }
}

As expected, whenever a request is sent, we increment the appropriate counter based on the AXI ID the request came from.

To see the problem with this scheme, think of this setup: we support only 1 AXI ID (so numIds is 1), and we support two transactions in flight for this AXI ID (so maxFlight is 2). We receive two AXI read requests from this single AXI ID, back-to-back, and AXI4ToTL will happily issue two TileLink Get requests back-to-back, each with unique r_ids, since r_count is incremented after the first request. After sending the second request, r_count is incremented a second time, which means r_id returns to the original value used for the first TileLink Get request. Now, if the two responses come back out-of-order (maybe request 1 is a cache miss and request 2 is a cache hit), and there's another AXI request waiting to be sent, then when we send request 3, we'll re-use the same TileLink source identifier from request 1, which is still in flight. This is illegal, since all outstanding TileLink requests must be uniquely identifiable by their source IDs.

So that's one problem with this system: we re-use TileLink source IDs when responses come back out-of-order, and this leads to the TileLink Monitors triggering assertions. There's one more problem though: let's assume we fixed AXI4ToTL so that it would only issue TileLink requests with valid source IDs when an out-of-order response came back. This response is still invalid from the AXI4 master's perspective, because it expects all of its requests from a single AXI ID to turn into well-ordered responses. Thus, AXI4ToTL would need re-ordering logic for all outstanding read requests within a single AXI ID to convert the out-of-order TileLink read responses into in-order AXI read responses.

Current behavior (write requests)

The write request logic in AXI4ToTL is the same as the read request logic, so it will encounter the same TileLink-source-reuse bug as described above. However, I don't believe the other logical bug (relative ordering of transactions within a single AXI ID) will be an issue. This is what the AXI4 specification has to say about write transaction ordering within an AXI ID:

The data transfers for a sequence of write transactions with the same AWID value must complete in the order in which the master issued the addresses

I'm not sure what restrictions this will place on how we issue TileLink Put requests, or how we re-order (or don't re-order) AXI4 write responses. Do we have to re-order AXI4 write responses in case write requests complete out-of-order? Do we need to ensure the TileLink interconnect will reflect those TileLink Put requests in-order? I'm not sure what the correct behavior is for this case, I've mostly put my time into thinking about the read-request scenario.

How to replicate

It looks like most use-cases for AXI4ToTL implicitly use the TLFIFOFixer, so this isn't a logical issue. However, I'll give an example of replicating this for the NVDLA. Periphery.scala for the NVDLA uses the fromMaster method to connect to the fbus, which implicitly inserts a TLFIFOFixer. If you instead use coupleFrom, then the TLFIFOFixer will disappear. Running an NVDLA workload will reveal this logical bug.

Ideas for resolution

Some of this is addressed in the previously-linked Chipyard Google Group discussion, but here's the summary:

I've implemented a straightforward design which has a number of limitations (doesn't support multi-beat AXI4 bursts, read-response-reordering space scales poorly with outstanding transactions and AXI4 IDs).
@zhemao gave some great ideas on how to improve the space scaling issues, and I've mostly implemented that design.
I'm unsure of how to resolve the inevitable space-scaling issues when supporting multi-beat bursts. Perhaps this is inevitable, but maybe we should give users the option to restrict the max AXI4 burst length (there's a mechanism for this for TileLink, but not really for AXI4).

Please tell us about your environment:

Chipyard release 1.3
Ubuntu 18.04

What is the use case for changing the behavior?

Improved AXI4 performance

Dec 21 '20 19:12 tymcauley