axi icon indicating copy to clipboard operation
axi copied to clipboard

Replacing axi_node_wrap_with_slices with axi_xbar in Ariane/CVA6

Open vsuresh95 opened this issue 3 years ago • 5 comments

Background: We are using the ESP platform for implementing a heterogeneous system. We observed that AXI crossbar implemented at the output of the Ariane core in ESP was adding 3 extra cycles of delay — leading to a high L2 cache access latency. You can find the instantiation of the crossbar here: Link to ESP's ariane_wrap file

Further, we observed that the crossbar was not actually necessary because the second (out of two) slave port of the cross was never used. Thus, we could replace the crossbar with a demultiplexer. We found the demux here: Link to the AXI demux

Questions:

  1. Is there a README or a reference implementation where this demux has been instantiated? We wanted to know how to connect the interfaces, as well as passing the address input as a select.
  2. What is cycle latency between the slave and master ports? This information will help us understand the benefit over the current crossbar implementation.

Any other suggestions or feedback will be great!

vsuresh95 avatar May 18 '22 20:05 vsuresh95

If I'm not completely mistaken, the file you link currently instantiates the following module from a separate repository: https://github.com/pulp-platform/axi_node/blob/a29a69a543e96d0c9f79ea9c7df20580b3da5002/src/axi_node_wrap_with_slices.sv

When using the axi_xbar module, this should automatically use the axi_demux module when declaring only one slave port. To configure the latency, the LatencyMode in the Cfg parameter can modify this. For more detail on how to use the modules, please refer to the doc folder, which has detailed descriptions of the implementation for both the axi_xbar and the axi_demux. For an implementation example, as you are using ariane I suggest to have a look at the master branch in ariane: https://github.com/openhwgroup/cva6/blob/75807530f26ba9a0ca501e9d3a6575ec375ed7ab/corev_apu/tb/ariane_testharness.sv#L477-L524.

micprog avatar May 19 '22 07:05 micprog

Thanks for the response! The reference implementation is very helpful! We will try to replace our current xbar with the xbar that you have pointed to — the interface seems to be similar. We have one follow-up question:

  1. When configure LatencyMode as NO_LATENCY, does that mean that the output is visible on the master port at the very next cycle as the slave port? Or does it mean that the design adds no additional latency cycles on top of a default latency (say, 2 or 3 cycles)?

vsuresh95 avatar May 20 '22 17:05 vsuresh95

Should be combinational, i.e. no cycle latency at all, but please check the documentation for how to use the LatencyMode parameter.

micprog avatar May 20 '22 17:05 micprog

Thanks for the help! Upon further debugging, we found that the additional cycles were not in fact from the crossbar in our case. There were two-stage synchronization modules before and after the crossbar that were adding 4 cycles in total, as seen here: https://github.com/pulp-platform/axi_node/blob/a29a69a543e96d0c9f79ea9c7df20580b3da5002/src/axi_node_wrap_with_slices.sv#L78-L106

We tried removing these entirely, but that caused bitstream synthesis to fail. We also tried reducing the number of stages to one (we don't really have a CDC at the crossbar). This time the synthesis worked but we're seeing instability while running Linux.

What is the purpose of these modules? Could there be a optimal implementation to reduce the delay?

vsuresh95 avatar May 26 '22 23:05 vsuresh95

Thanks for the help! Upon further debugging, we found that the additional cycles were not in fact from the crossbar in our case. There were two-stage synchronization modules before and after the crossbar that were adding 4 cycles in total, as seen here: https://github.com/pulp-platform/axi_node/blob/a29a69a543e96d0c9f79ea9c7df20580b3da5002/src/axi_node_wrap_with_slices.sv#L78-L106

Those modules are neither CDCs nor synchronizers, they are pipeline registers.

We tried removing these entirely, but that caused bitstream synthesis to fail. We also tried reducing the number of stages to one (we don't really have a CDC at the crossbar). This time the synthesis worked but we're seeing instability while running Linux.

Are you still using axi_node_wrap_with_slices? If so, please switch to axi_xbar as explained by @micprog above. axi_node has been deprecated multiple years ago.

If you are already using the modules in this repository and have questions, please be more specific. We cannot debug "bitstream synthesis fails" or "instability while running Linux" in an on-chip network repository.

What is the purpose of these modules? Could there be a optimal implementation to reduce the delay?

Pipeline registers reduce the longest combinational path. The optimal configuration depends on the surrounding circuit.

andreaskurth avatar Jul 06 '22 08:07 andreaskurth