Replacing axi_node_wrap_with_slices with axi_xbar in Ariane/CVA6
Background: We are using the ESP platform for implementing a heterogeneous system. We observed that AXI crossbar implemented at the output of the Ariane core in ESP was adding 3 extra cycles of delay — leading to a high L2 cache access latency. You can find the instantiation of the crossbar here: Link to ESP's ariane_wrap file
Further, we observed that the crossbar was not actually necessary because the second (out of two) slave port of the cross was never used. Thus, we could replace the crossbar with a demultiplexer. We found the demux here: Link to the AXI demux
Questions:
- Is there a README or a reference implementation where this demux has been instantiated? We wanted to know how to connect the interfaces, as well as passing the address input as a select.
- What is cycle latency between the slave and master ports? This information will help us understand the benefit over the current crossbar implementation.
Any other suggestions or feedback will be great!
If I'm not completely mistaken, the file you link currently instantiates the following module from a separate repository: https://github.com/pulp-platform/axi_node/blob/a29a69a543e96d0c9f79ea9c7df20580b3da5002/src/axi_node_wrap_with_slices.sv
When using the axi_xbar module, this should automatically use the axi_demux module when declaring only one slave port. To configure the latency, the LatencyMode in the Cfg parameter can modify this. For more detail on how to use the modules, please refer to the doc folder, which has detailed descriptions of the implementation for both the axi_xbar and the axi_demux. For an implementation example, as you are using ariane I suggest to have a look at the master branch in ariane: https://github.com/openhwgroup/cva6/blob/75807530f26ba9a0ca501e9d3a6575ec375ed7ab/corev_apu/tb/ariane_testharness.sv#L477-L524.
Thanks for the response! The reference implementation is very helpful! We will try to replace our current xbar with the xbar that you have pointed to — the interface seems to be similar. We have one follow-up question:
- When configure
LatencyModeasNO_LATENCY, does that mean that the output is visible on the master port at the very next cycle as the slave port? Or does it mean that the design adds no additional latency cycles on top of a default latency (say, 2 or 3 cycles)?
Should be combinational, i.e. no cycle latency at all, but please check the documentation for how to use the LatencyMode parameter.
Thanks for the help! Upon further debugging, we found that the additional cycles were not in fact from the crossbar in our case. There were two-stage synchronization modules before and after the crossbar that were adding 4 cycles in total, as seen here: https://github.com/pulp-platform/axi_node/blob/a29a69a543e96d0c9f79ea9c7df20580b3da5002/src/axi_node_wrap_with_slices.sv#L78-L106
We tried removing these entirely, but that caused bitstream synthesis to fail. We also tried reducing the number of stages to one (we don't really have a CDC at the crossbar). This time the synthesis worked but we're seeing instability while running Linux.
What is the purpose of these modules? Could there be a optimal implementation to reduce the delay?
Thanks for the help! Upon further debugging, we found that the additional cycles were not in fact from the crossbar in our case. There were two-stage synchronization modules before and after the crossbar that were adding 4 cycles in total, as seen here: https://github.com/pulp-platform/axi_node/blob/a29a69a543e96d0c9f79ea9c7df20580b3da5002/src/axi_node_wrap_with_slices.sv#L78-L106
Those modules are neither CDCs nor synchronizers, they are pipeline registers.
We tried removing these entirely, but that caused bitstream synthesis to fail. We also tried reducing the number of stages to one (we don't really have a CDC at the crossbar). This time the synthesis worked but we're seeing instability while running Linux.
Are you still using axi_node_wrap_with_slices? If so, please switch to axi_xbar as explained by @micprog above. axi_node has been deprecated multiple years ago.
If you are already using the modules in this repository and have questions, please be more specific. We cannot debug "bitstream synthesis fails" or "instability while running Linux" in an on-chip network repository.
What is the purpose of these modules? Could there be a optimal implementation to reduce the delay?
Pipeline registers reduce the longest combinational path. The optimal configuration depends on the surrounding circuit.