litedram
litedram copied to clipboard
Improve crossbar's interface to avoid bottleneck when accessing different banks from the same port.
Identified bottleneck:
The VexRiscv SMP cluster is directly connected to LiteDRAM through 2x 128-bit Instruction /Data native LiteDRAM ports:
While doing initial tests with VexRiscv SMP, a bottleneck on LiteDRAM's crossbar has been identified:
- The CPUs of the cluster are sharing the same LiteDRAM interface and want potentially to access different banks of the DRAM.
- A port can currently only access one bank at a time and has to wait for the command to be emitted on the DRAM bus to switch to another bank (since the BankMachine will lock this port).
The current BankMachine lock mechanism is providing a simple way to avoid data buffering in the crossbar while also ensuring order of the transactions but is now limiting performance and should be improved.
Reproducing the issue with VexRiscv SMP:
In https://github.com/enjoy-digital/litex_vexriscv_smp apply the following patch to crt0.S
:
boot_helper:
li a0, 0x40000000
li s0, 0x800
add a1, a0, s0
add a2, a1, s0
add a3, a2, s0
add a4, a3, s0
loop_me:
sw x0, (a0)
sw x0, (a1)
sw x0, (a0)
sw x0, (a1)
sw x0, (a0)
sw x0, (a1)
sw x0, (a0)
sw x0, (a1)
j loop_me
sw x10, smp_lottery_args , x14
sw x10, smp_lottery_args , x14
sw x11, smp_lottery_args+4, x14
sw x11, smp_lottery_args+4, x14
And run the simulation with the traces (--trace
), the bottleneck can be observed by looking at the native LiteDRAM port between the VexRiscv SMP cluster and LiteDRAM.
Proposed solution:
To remove this bottleneck, the lock mechanism should probably be removed and others mechanisms introduced for writes and reads:
Write path:
For the write path, each port could maintain cmd_idx
and pending_xfers
values (up to a N
that should be configurable) and for each write:
- Send the command to the BankMachine along with the
cmd_idx
ifpending_xfers
< N, else wait until condition is satisfied. - Store the write data to a
data-width*N
memory atcmd_idx
location. - Increment the
cmd_idx
(modulo N) andpending_xfers
. - Let the BankMachine retrieve the data from
cmd_idx
that was passed to it and decrementpending_xfers
when BankMachine is accessing the data memory.
Read path:
For the read path, each port could maintain cmd_idx
, return_idx
and pending_xfers
values (up to a N
that should be configurable) and for each read:
- Send the command to the BankMachine along with the
cmd_idx
ifpending_xfers
< N, else wait until condition is satisfied. - Increment the
cmd_idx
(modulo N) andpending_xfers
. - Let the BankMachine return the read data along with the
cmd_idx
, the data will be written to the returnedcmd_idx
location. - Return the read data to the port if memory has valid data at
return_idx
location. Once data is presented and accepted,return_idx
memory location should be invalidated,return_idx
incremented (modulo N) andpending_xfers
decremented.
cc @jedrzejboczar, @dolu1990, @kgugala.
If i remember well, the lock was reducing the bandwidth by 75%. In practice when 4 CPU were doing busy work, it realy hit the performances.