Impact: software

What is the current behavior? When running the nvdla-small-resnet50.json and nvdla-large-resnet50.json workloads on the nv_small and nv_large FireSim configurations respectively, I'm getting some pretty bad performance numbers, as well as some Linux kernel error messages.

Specifically, the nv_large configuration reports an execution time of 457245 microseconds, or 2.18 frames/second. The nv_small configuration reports an execution time of 462657 microseconds, or 2.16 frames/second. This is assuming the default FireSim global CPU frequency of 3.2 GHz.

What is the expected behavior? NVIDIA gives some performance estimates here for ResNet-50 running on an NVDLA clocked at 1 GHz (3.2 times slower than the FireSim model): http://nvdla.org/primer.html#example-area-and-performance-with-nvdla

They range from 269 frames/sec for a 2048-MAC design (like the nv_large) to 7.3 frames/sec for a 64-MAC design (like the nv_small). I'm not sure how accurate these estimates are, or what assumptions they've made about the memory system, but I would expect to see some rough correlation with those numbers.

Please tell us about your environment: - version: 9c73037e93be4dbea93ee9738ef9fb530287a226 - os: Linux ip-192-168-0-168.ec2.internal 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - other: Ran through the Chipyard and FireSim setup steps, then built AGFIs for these NVDLA FireSim configurations, then swapped to the nvdla-linux-v57 branch of riscv-linux before building the workloads.

Other information Before running this test, I commented out the bodies of these four functions so that the NVDLA's performance wouldn't be swamped with printing. Here's the relevant uartlog section for the nv_large test:

+ ./nvdla_runtime --loadable resnet50_large.nvdla --image dog.jpg
creating new runtime context...
Emulator starting
dlaimg height: 224 x 224 x 3: LS: 7168 SS: 0 Size: 1605632
submitting tasks...
[    0.586978] BUG: spinlock bad magic on CPU#0, swapper/0/0
[    0.586988]  lock: 0xffffffe3eff90328, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.586999] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      5.7.0-rc3-58540-g66e8cf3c569c #8
[    0.587008] Call Trace:
[    0.587077] [<ffffffe00020243a>] walk_stackframe+0x0/0xaa
[    0.587245] [<ffffffe000202626>] show_stack+0x2a/0x34
[    0.587403] [<ffffffe0004864da>] dump_stack+0x6e/0x88
[    0.587559] [<ffffffe000243b40>] spin_dump+0x68/0x74
[    0.587714] [<ffffffe00024374a>] do_raw_spin_lock+0xb0/0xcc
[    0.587888] [<ffffffe000873fa8>] _raw_spin_lock_irqsave+0x20/0x2c
[    0.588206] [<ffffffdf81150294>] nvdla_engine_isr+0x20/0x58 [opendla]
[    0.588280] [<ffffffe0002480a2>] __handle_irq_event_percpu+0x4c/0xe4
[    0.588477] [<ffffffe000248154>] handle_irq_event_percpu+0x1a/0x5c
[    0.588670] [<ffffffe0002481ce>] handle_irq_event+0x38/0x64
[    0.588844] [<ffffffe00024b5ae>] handle_fasteoi_irq+0xae/0x13e
[    0.589026] [<ffffffe0002473ca>] generic_handle_irq+0x28/0x36
[    0.589205] [<ffffffe00049b904>] plic_handle_irq+0x8e/0xd4
[    0.589376] [<ffffffe000874440>] do_IRQ+0x60/0xb8
[    0.589524] [<ffffffe0002011b2>] ret_from_exception+0x0/0xc
[    0.589709] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
[    0.589960] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 244, name: nvdla_runtime
[    0.590221] CPU: 2 PID: 244 Comm: nvdla_runtime Tainted: G           O      5.7.0-rc3-58540-g66e8cf3c569c #8
[    0.590526] Call Trace:
[    0.590604] [<ffffffe00020243a>] walk_stackframe+0x0/0xaa
[    0.590772] [<ffffffe000202626>] show_stack+0x2a/0x34
[    0.590930] [<ffffffe0004864da>] dump_stack+0x6e/0x88
[    0.591088] [<ffffffe000223caa>] ___might_sleep+0xd4/0xe4
[    0.591256] [<ffffffe000223cf6>] __might_sleep+0x3c/0x66
[    0.591422] [<ffffffe00087184c>] mutex_lock+0x22/0x44
[    0.591579] [<ffffffe00064a2ba>] dma_buf_vmap+0x26/0x9c
[    0.591840] [<ffffffdf811505c8>] dla_data_read+0x72/0xe4 [opendla]
[    0.592022] [<ffffffdf8114b5f0>] dla_submit_operation+0x150/0x2e4 [opendla]
[    0.592238] [<ffffffdf8114b820>] dla_dequeue_operation+0x9c/0x128 [opendla]
[    0.592454] [<ffffffdf8114bb12>] dla_op_completion+0x220/0x2a4 [opendla]
[    0.592664] [<ffffffdf8114bd28>] dla_process_events+0x192/0x1d0 [opendla]
[    0.592875] [<ffffffdf8115069a>] nvdla_task_submit+0x60/0xa4 [opendla]
[    0.593079] [<ffffffdf81150a0c>] nvdla_submit+0xa4/0xf8 [opendla]
[    0.593185] [<ffffffe00051b83a>] drm_ioctl_kernel+0x6e/0xaa
[    0.593358] [<ffffffe00051bb54>] drm_ioctl+0x184/0x286
[    0.593518] [<ffffffe0003175ae>] ksys_ioctl+0x144/0x654
[    0.593681] [<ffffffe000317ace>] sys_ioctl+0x10/0x18
[    0.593836] [<ffffffe0002011a4>] ret_from_syscall+0x0/0x2
Work Found!
Work Done
execution time = 457245.000000 s
Shutdown signal received, exiting

And here's the relevant uartlog section for the nv_small test:

+ ./nvdla_runtime --loadable resnet50_small.nvdla --image dog.jpg
creating new runtime context...
Emulator starting
dlaimg height: 224 x 224 x 3: LS: 1792 SS: 0 Size: 401408
submitting tasks...
[    0.586561] BUG: spinlock bad magic on CPU#0, swapper/0/0
[    0.586570]  lock: 0xffffffe3eeee1da8, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.586581] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      5.7.0-rc3-58540-g66e8cf3c569c #9
[    0.586589] Call Trace:
[    0.586659] [<ffffffe00020243a>] walk_stackframe+0x0/0xaa
[    0.586827] [<ffffffe000202626>] show_stack+0x2a/0x34
[    0.586985] [<ffffffe0004864da>] dump_stack+0x6e/0x88
[    0.587142] [<ffffffe000243b40>] spin_dump+0x68/0x74
[    0.587296] [<ffffffe00024374a>] do_raw_spin_lock+0xb0/0xcc
[    0.587471] [<ffffffe000873fa8>] _raw_spin_lock_irqsave+0x20/0x2c
[    0.587789] [<ffffffdf81150294>] nvdla_engine_isr+0x20/0x58 [opendla]
[    0.587862] [<ffffffe0002480a2>] __handle_irq_event_percpu+0x4c/0xe4
[    0.588060] [<ffffffe000248154>] handle_irq_event_percpu+0x1a/0x5c
[    0.588253] [<ffffffe0002481ce>] handle_irq_event+0x38/0x64
[    0.588426] [<ffffffe00024b5ae>] handle_fasteoi_irq+0xae/0x13e
[    0.588608] [<ffffffe0002473ca>] generic_handle_irq+0x28/0x36
[    0.588788] [<ffffffe00049b904>] plic_handle_irq+0x8e/0xd4
[    0.588959] [<ffffffe000874440>] do_IRQ+0x60/0xb8
[    0.589106] [<ffffffe0002011b2>] ret_from_exception+0x0/0xc
[    0.589291] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
[    0.589542] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 244, name: nvdla_runtime
[    0.589803] CPU: 3 PID: 244 Comm: nvdla_runtime Tainted: G           O      5.7.0-rc3-58540-g66e8cf3c569c #9
[    0.590109] Call Trace:
[    0.590187] [<ffffffe00020243a>] walk_stackframe+0x0/0xaa
[    0.590354] [<ffffffe000202626>] show_stack+0x2a/0x34
[    0.590513] [<ffffffe0004864da>] dump_stack+0x6e/0x88
[    0.590671] [<ffffffe000223caa>] ___might_sleep+0xd4/0xe4
[    0.590838] [<ffffffe000223cf6>] __might_sleep+0x3c/0x66
[    0.591006] [<ffffffe00087184c>] mutex_lock+0x22/0x44
[    0.591162] [<ffffffe00064a2ba>] dma_buf_vmap+0x26/0x9c
[    0.591422] [<ffffffdf811505c8>] dla_data_read+0x72/0xe4 [opendla]
[    0.591604] [<ffffffdf8114b5f0>] dla_submit_operation+0x150/0x2e4 [opendla]
[    0.591820] [<ffffffdf8114b820>] dla_dequeue_operation+0x9c/0x128 [opendla]
[    0.592037] [<ffffffdf8114bb12>] dla_op_completion+0x220/0x2a4 [opendla]
[    0.592246] [<ffffffdf8114bd28>] dla_process_events+0x192/0x1d0 [opendla]
[    0.592458] [<ffffffdf8115069a>] nvdla_task_submit+0x60/0xa4 [opendla]
[    0.592661] [<ffffffdf81150a0c>] nvdla_submit+0xa4/0xf8 [opendla]
[    0.592767] [<ffffffe00051b83a>] drm_ioctl_kernel+0x6e/0xaa
[    0.592940] [<ffffffe00051bb54>] drm_ioctl+0x184/0x286
[    0.593101] [<ffffffe0003175ae>] ksys_ioctl+0x144/0x654
[    0.593263] [<ffffffe000317ace>] sys_ioctl+0x10/0x18
[    0.593419] [<ffffffe0002011a4>] ret_from_syscall+0x0/0x2
Work Found!
Work Done
execution time = 462657.000000 s
Shutdown signal received, exiting

Note that these kernel error messages appear whether or not I have those dla_* functions working. I'm not sure if these error messages are related to the performance issues, but figured they were worth mentioning.

Also note that the "execution time" message should be interpreted as a duration in microseconds, not seconds. The relevant code for calculating that duration can be found here.

May 26 '20 20:05 tymcauley

Yes there seems to be two issues here.

Linux Kernel error messages: I think this is a result of bumping to Linux 5.7 recently. As you can see by this https://github.com/ucb-bar/nvdla-sw/commit/eab38cdc3a9f0c1badc822da9a26c5a2274c62d4 it looks like I only made one change, however, the kernel actually removed the dma_<declare/release>_coherent_memory functions from being used in external drivers (so I re-added them in until NVDLA SW catches up). As for why that errors, I don't know. I'll try to see if something sticks out.
I just checked my latest results and it looks like we are relatively the same. I think another way to diagnose this is to see if Farzad's prior repo is having the same performance difference. If that's true, then I think you can speed up things by connecting directly to the sbus both configuration and memory ports. Maybe BOOM can be used to speed up the softmax that is emulated on the core-side also since that is a large majority of the runtime. I'll try building a BOOM NVDLA with my recommended changes.

May 26 '20 22:05 abejgonzalez

Oh wow, you're right, I figured that the emulator runtime would be really quick since it's just running softmax on 1000 entries. I made this change to record the execution time breakdown:

diff --git a/umd/core/src/runtime/Runtime.cpp b/umd/core/src/runtime/Runtime.cpp
index d8640e6..9811ebd 100644
--- a/umd/core/src/runtime/Runtime.cpp
+++ b/umd/core/src/runtime/Runtime.cpp
@@ -32,6 +32,8 @@
 #include <map>
 #include <list>

+#include <time.h>
+
 #include "dlatypes.h"
 #include "dlaerror.h"

@@ -584,6 +586,13 @@ bool Runtime::fillEMUTaskAddressList(Task *task, EMUTaskDescAccessor taskDescAcc
     return true;
 }

+double get_elapsed_time(struct timespec *before, struct timespec *after)
+{
+  double deltat_s  = (after->tv_sec - before->tv_sec) * 1000000.0;
+  double deltat_ns = (after->tv_nsec - before->tv_nsec) / 1000.0;
+  return deltat_s + deltat_ns;
+}
+
 bool Runtime::submit()
 {
     NvDlaError e = NvDlaSuccess;
@@ -595,6 +604,7 @@ NvDlaError Runtime::submitInternal()
 {
     NvDlaError e = NvDlaSuccess;
     Task *task;
+    struct timespec before, after;

     vector<EMUTaskDescAccessor*> emu_task_descs;
     size_t ii;
@@ -666,7 +676,10 @@ NvDlaError Runtime::submitInternal()

                     fillTaskAddressList(task, &dla_task);

+                    clock_gettime(CLOCK_MONOTONIC, &before);
                     PROPAGATE_ERROR_FAIL( NvDlaSubmit(NULL, dev, &dla_task, 1) );
+                    clock_gettime(CLOCK_MONOTONIC, &after);
+                    gLogInfo << "DLA execution time = " << get_elapsed_time(&before, &after) << " us" << endl;
                 }
                 break;
                 case ILoadable::Interface_EMU1:
@@ -695,7 +708,10 @@ NvDlaError Runtime::submitInternal()

                     fillEMUTaskAddressList(task, *(emu_task_descs.back()));

+                    clock_gettime(CLOCK_MONOTONIC, &before);
                     PROPAGATE_ERROR_FAIL( m_emu_engine->submit(task_mem, 1) );
+                    clock_gettime(CLOCK_MONOTONIC, &after);
+                    gLogInfo << "Emulator execution time = " << get_elapsed_time(&before, &after) << " us" << endl;

                     emu_instance = (emu_instance + 1) % num_emu_instances;

This is what I get for the nv_large:

...
libnvdla<3> DLA execution time = 18772 us
Work Found!
Work Done
libnvdla<3> Emulator execution time = 438073 us
execution time = 457079.000000 s
...

And this is for the nv_small:

...
libnvdla<3> DLA execution time = 44576 us
Work Found!
Work Done
libnvdla<3> Emulator execution time = 417864 us
execution time = 462670.000000 s
...

That DLA breakdown of 18.8 ms vs 44.6 ms for nv_large vs nv_small seems much more believable, although that isn't exactly inspiring performance (53.3 frame/sec and 22.4 frames/sec respectively, ignoring the emulator runtime), especially since the whole DUT is running at 3.2 GHz.

May 26 '20 23:05 tymcauley

Yup. I discovered this when doing my class project using NVDLA and was super surprised (measuring in a similar fashion but with cycle counts). Gemmini (UCB's DL accel.) doesn't do a softmax at the end so it saves on that compute.

I'm spinning up a BOOM + NVDLA AFI right now so hopefully, I can get Linux answers soon.

May 26 '20 23:05 abejgonzalez

I just put some of the same clock_gettime code in Emulator.cpp to see how long running softmax takes, and in Emulator::run() in between m_taskQueue.front() and m_taskQueue.pop(), only ~300-500 us elapse. Perhaps the thread context switching is extremely slow? I might try moving the emulator into the main thread and see if that solves things, but I'm going to have to wait on that until tomorrow.

May 27 '20 00:05 tymcauley

Well, turns out that was simpler than I thought, and it did the trick. So far I've only tested on the nv_large, but the total execution time was between 23 ms and 16 ms. Want me to submit a PR to ucb-bar/nvdla-sw with the change?

May 27 '20 16:05 tymcauley

Although, if I had to guess, I'd say this is the performance culprit for the multi-threaded scenario: https://github.com/ucb-bar/nvdla-sw/blob/eab38cdc3a9f0c1badc822da9a26c5a2274c62d4/umd/core/src/runtime/Emulator.cpp#L157-L160

May 27 '20 16:05 tymcauley

I'll take a look at the PR today. I'm also going to bump a bit of the SW to clean up prints later today / this week to complement this.

May 27 '20 17:05 abejgonzalez

~~As an update, I think the 2nd exception is mainly due to the 1st. The 1st sets up an atomic section then errors (without exiting the atomic section), thus when the mutex is accessed in the 2nd section it is still in the atomic section and is invalid. Though at the moment I'm not exactly sure why the 1st part errors.~~

Nevermind, I have the solution for the 1st error. Now for the 2nd (doesn't seem like they are linked), it looks like you can't have the mutex_lock (within the dma_buf_vmap function) happen within a critical section (set up by the nvdla_task_submit function).

I'll try to look into this tomorrow since I now have to learn about Linux mutexes in more depth.

May 27 '20 23:05 abejgonzalez

@abejgonzalez: Sleeping while holding a spinlock is a recipe for deadlock, especially when kernel preemption has been locally disabled by disabling interrupts. This is why mutex_lock() is unsafe to use in atomic context, since the blocked task is potentially put to sleep if it cannot acquire the mutex.

May 28 '20 05:05 a0u

Sure. I'm guessing that the dma_buf_vmap function was updated from 5.3 -> 5.7 which may now cause the problem. But I would have to look into it more.

May 28 '20 07:05 abejgonzalez

Ok upon further inspection, the 2nd bug (BUG: sleeping function...) seems to be a problem with how the driver is written (so far as I can tell, where dma_buf_vmap can be called in an atomic section). This issue was present before the kernel bump (from 5.3 to 5.7rc3) but was masked because the below CONFIG parameters were previously unset (in the most recent kernel they are by default set to y). Note: These kernel configs also masked the 1st bug.

CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_ATOMIC_SLEEP=y

If you want to mask out the warning, feel free to unset those flags in the nvdla-kfrag. I don't know what the right solution is (use something else other than dma_buf_vmap that doesn't use mutexes/sleeps?). I think for now I will post an issue up on the NVDLA workload repository trying to track the problem to solve it by the next release (1.4+). I don't think I will find a solution before the 1.3 release (esp. since this bug was already present earlier).

May 29 '20 20:05 abejgonzalez

Thanks for investigating that @abejgonzalez! I agree, no need to get this fixed ASAP for the 1.3 release.

May 29 '20 23:05 tymcauley

Alright, now that we've got the inference latency down from half a second to tens of milliseconds, I wanted to try and figure out what the remaining performance issues are. TLDR: I think they might be in the AXI <-> TileLink converter stack on the NVDLA's DBB port (the main DMA port). Caveat: I am not an expert in how those diplomatic widgets work.

Here's what the AXI <-> TL conversion looks like for the NVDLA's DBB port:

(dbb_tl_node
  := TLBuffer()
  := TLWidthWidget(dataWidthAXI/8)
  := AXI4ToTL()
  := AXI4UserYanker(capMaxFlight=Some(16))
  := AXI4Fragmenter()
  := AXI4IdIndexer(idBits=3)
  := AXI4Buffer()
  := dbb_axi_node)

So, I did some printf synthesis on the AXI and TileLink sides of this port to see what the transactions look like when running ResNet-50 in FireSim. Here are a few excerpts of the results (you might notice in these logs that my fbus and sbus are 256-bits, but other than that, this is a vanilla 4-Rocket-core-with-inclusive-L2-cache system).

Here is a sample of the very first layer starting its read-DMA transactions:

[4238063775] CFG: TL.A { opcode: PutFullData, param: 0, size: 2 (4 bytes), source: 16800, address: 0x10048008, mask: 0xf, data: 0x1, corrupt: false }
[4238063776] CFG: TL.D { opcode: AccessAck, size: 2 (4 bytes), source: 16800, data: 0x0 }
[4238063796] DBB: AXI.AR { id: 5, addr: 0xc4004000, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063797] DBB: AXI.AR { id: 5, addr: 0xc4004008, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063798] DBB: AXI.AR { id: 5, addr: 0xc4004010, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063798] DBB: TL.A { opcode: Get, param: 0, size: 3 (8 bytes), source: 180, address: 0xc4004000, mask: 0xff, data: 0x0, corrupt: false }
[4238063799] DBB: AXI.AR { id: 5, addr: 0xc4004018, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063800] DBB: AXI.AR { id: 5, addr: 0xc4004020, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063801] DBB: AXI.AR { id: 5, addr: 0xc4004028, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063845] DBB: TL.D { opcode: AccessAckData, size: 3 (8 bytes), source: 180, data: 0xddc4ed16ec181aa6 }
[4238063846] DBB: TL.A { opcode: Get, param: 0, size: 3 (8 bytes), source: 182, address: 0xc4004008, mask: 0xff00, data: 0x0, corrupt: false }
[4238063847] DBB: AXI.R { id: 5, data: 0xddc4ed16ec181aa6, resp: Okay, last: true }
[4238063849] DBB: AXI.AR { id: 5, addr: 0xc4004030, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238063878] DBB: TL.D { opcode: AccessAckData, size: 3 (8 bytes), source: 182, data: 0xee68ebc4253917f70000000000000000 }
[4238063879] DBB: TL.A { opcode: Get, param: 0, size: 3 (8 bytes), source: 184, address: 0xc4004010, mask: 0xff0000, data: 0x0, corrupt: false }
[4238063880] DBB: AXI.R { id: 5, data: 0xee68ebc4253917f7, resp: Okay, last: true }
[4238063882] DBB: AXI.AR { id: 5, addr: 0xc4004038, len: 0, size: 3 (8 bytes), burst: Fixed }
...

Those first two lines show the NVDLA receiving the start-your-read-DMA command for the SDP unit:

0x10040000 is the NVDLA config interface's base address
0x8008 is the offset in the NVDLA for the SDP read-DMA's enable register

The NVDLA's DBB port then issues 6 back-to-back read requests (the ARs at t = 4238063796 - 4238063801), while the TileLink side of the converter sends out a single Get request for the first AR at t = 4238063798. We then wait about 45 cycles until we get the TileLink AccessAckData response, and then the TileLink logic issues the second of those 6 ARs. Once that AccessAckData goes through the two buffer stages and becomes a R response at the NVDLA's DBB port, the NVDLA issues a seventh AR request.

Note that all of the NVDLA's requests are for sequential memory addresses, but they all use the same id. An AXI interconnect should be able to have all of these transactions in flight at the same time, but the master expects all of the responses to return in-order. Is there any way that we can get these TileLink transactions to run in-flight simultaneously?

Here's what things look like when the NVDLA starts issuing writes, and it's quite similar to the read timing:

[4238233899] DBB: AXI.W { data: 0x10018000a00002a, strb: 0xff, last: true }
[4238233899] DBB: AXI.AW { id: 1, addr: 0xc1000000, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233901] DBB: TL.A { opcode: PutPartialData, param: 0, size: 3 (8 bytes), source: 59, address: 0xc1000000, mask: 0xff, data: 0x10018000a00002a010018000a00002a010018000a00002a010018000a00002a, corrupt: false }
[4238233908] DBB: AXI.W { data: 0x40018000002000a, strb: 0xff, last: true }
[4238233908] DBB: AXI.AW { id: 1, addr: 0xc1000008, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233917] DBB: AXI.W { data: 0x70018000003000d, strb: 0xff, last: true }
[4238233917] DBB: AXI.AW { id: 1, addr: 0xc1000010, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233926] DBB: AXI.W { data: 0x60018000003000d, strb: 0xff, last: true }
[4238233926] DBB: AXI.AW { id: 1, addr: 0xc1000018, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233935] DBB: AXI.W { data: 0x60018000003000e, strb: 0xff, last: true }
[4238233935] DBB: AXI.AW { id: 1, addr: 0xc1000020, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233944] DBB: AXI.W { data: 0x70018000003000d, strb: 0xff, last: true }
[4238233944] DBB: AXI.AW { id: 1, addr: 0xc1000028, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233954] DBB: TL.D { opcode: AccessAck, size: 3 (8 bytes), source: 59, data: 0x0 }
[4238233955] DBB: TL.A { opcode: PutPartialData, param: 0, size: 3 (8 bytes), source: 61, address: 0xc1000008, mask: 0xff00, data: 0x40018000002000a040018000002000a040018000002000a040018000002000a, corrupt: false }
[4238233956] DBB: AXI.B { id: 1, resp: Okay }
[4238233958] DBB: AXI.W { data: 0x60018000003000d, strb: 0xff, last: true }
[4238233958] DBB: AXI.AW { id: 1, addr: 0xc1000030, len: 0, size: 3 (8 bytes), burst: Fixed }
[4238233962] DBB: TL.D { opcode: AccessAck, size: 3 (8 bytes), source: 61, data: 0x0 }
[4238233963] DBB: TL.A { opcode: PutPartialData, param: 0, size: 3 (8 bytes), source: 63, address: 0xc1000010, mask: 0xff0000, data: 0x70018000003000d070018000003000d070018000003000d070018000003000d, corrupt: false }
[4238233964] DBB: AXI.B { id: 1, resp: Okay }
...

Many AXI AW requests with the same id, and only a single TileLink PutPartialData request in flight at once.

Any ideas on improving performance? It seems like TileLink should allow us to have multiple of these AXI requests in flight at once, and then some reordering logic will be required once the responses arrive. Is that logic implemented in the TileLink converter widgets?

Edit: Also, I'm more than happy to file this in a new issue (or start a thread on the Chipyard google group), I realize that this is getting into a totally separate topic from the above discussion.

Jun 02 '20 01:06 tymcauley

Well, looks like I spoke too soon on the root cause of this problem. These TileLink transactions aren't being delayed because the TL <-> AXI conversion logic isn't asserting a.valid, but rather because the FrontBus is deasserting a.ready (see the waveforms below for more information). I tried adjusting the NVDLAFrontBusExtraBuffers parameter, but that didn't make any difference in performance.

It looks like the root cause for this is the TLFIFOFixer that gets inserted between the NVDLA's TileLink DBB port and the FrontBus crossbar. That widget shows up as a result of this method use in the NVDLA's Periphery.scala, fromMaster. If we instead connect the DBB port to the FrontBus using the coupleFrom method, the TLFIFOFixer is no longer inserted:

-    fbus.fromMaster(name = Some("nvdla_dbb"), buffer = BufferParams.default) {
-      TLBuffer.chainNode(p(NVDLAFrontBusExtraBuffers))
-    } := nvdla.dbb_tl_node
+    fbus.coupleFrom("nvdla_dbb") { bus =>
+      (bus
+        := TLBuffer(BufferParams(p(NVDLAFrontBusExtraBuffers)))
+        := nvdla.dbb_tl_node)}

Referring to this documentation, it seems like the TLFIFOFixer shouldn't be necessary in this context, since the NVDLA's TileLink client only cares about responses from a single TileLink manager, which is the L2 coherency agent. Please correct me if I'm incorrectly interpreting the purpose of the TLFIFOFixer!

I haven't tested this change in FireSim yet to see if the ResNet results improve (or if we still get the same output).

Waveforms

Here's a waveform showing the existing implementation of the NVDLA's DBB port (both the AXI and TL sides), issuing a series of read requests. You can see that the tl.a.ready signal is deasserted after each tl.a request fires. I dug into where that deassertion was coming from, and it was within the TLFIFOFixer.

original-nv-small-performance

Here's a waveform showing the same simulation, but after making the change mentioned above (moving from fromMaster to coupleFrom). You can see that many back-to-back tl.a requests are getting out before tl.a.ready deasserts. At this point performance probably needs to be optimized through the addition of TLBuffers, but this is a huge improvement over the current design.

no-tlfifofixer-nv-small-performance

Jun 04 '20 16:06 tymcauley

Turns out the interaction between the various AXI widgets and the TLFIFOFixer is pretty subtle, so I'm going to move that discussion to this thread in the Chipyard Google Group: https://groups.google.com/g/chipyard/c/dGCYZZB12is

Jul 03 '20 17:07 tymcauley

Hello @abejgonzalez , I encountered a similar issue while starting Linux using NVDLA. Can you help me take a look? I explained the specific situation in https://github.com/ucb-bar/chipyard/issues/1558#issuecomment-1639242635

Jul 18 '23 03:07 Yuxin-Yu

chipyard
chipyard copied to clipboard

Slow NVDLA performance, plus Linux kernel error messages

Waveforms

chipyard chipyard copied to clipboard

Slow NVDLA performance, plus Linux kernel error messages

Waveforms

chipyard
chipyard copied to clipboard