calyx
calyx copied to clipboard
AXI returns unexpected 0s
Hi,
Following the previous issue #1470, we have managed to run a 4x4 GEMM on a Standalone (bared-metal) platform.
When we tried bigger matrix sizes (8x8, 16x16), I printed the output matrix from the ARM sides, the IP core only gave us 16 correct elements (2 rows of correct elements for 8x8, 1 row of correct elements for 16x16). The rest elements are all zeros:
The correct output should be
When I delve into the generated Toplevel AXI verilog code, I found some clues.
In our case, 'm1' outputs the resultant matrix. We go to the module Memory_controller_axi_1, and look at this snippet
reg [6:0] send_addr_offset;
...
assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset * 32;
bram_read_data is the result from the core GEMM logic (ext_mem0__write_data) and always produces correct results.
BUT, when we try bigger matrix sizes and send_addr_offset equals is bigger than 16, WDATA is always zero because it left shifts (>16)x32 bit which is over 512-b. Previously, we could get the correct result from 4x4 because we luckily saturated the 512-b with 16 resultant elements.
I am sure the core logic without the AXI wrapper can output correct results, we have tested 4x4 to 64x64. We have changed discover-external:default=xxx according to our matrix sizes. From the generated AXI verilog code we can tell that discover-external:default=xxx affects send_addr_offset and thereby affects the number of shifts. I attached the MLIR and the AXI code here for convenient reference.
gemm8x8.mlir.txt
gemm8x8_wrapper.sv.txt
Also the doubled under-score issue in main kernel_inst still exists, signal inconsistency leads to synthesis failure.
Appreciate it if there is any solution for returning 0s and the doubled underscore issue.
Many thanks.
This issue can be addressed by adding the following Verilog to the destination master port in AXI top-level file
reg [3:0] shift_var;
always @(posedge ACLK) begin
if(memory_mode_state == 3) begin
if(BVALID & BREADY) shift_var <= shift_var + 'd1;
//else if (BVALID & BREADY & shift_var=='d15) shift_var<='d0;
else shift_var <= shift_var;
end
else shift_var <= 0;
end
And replace send_addr_offset with shift_var in
assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset* 32;
assign WSTRB = {{15{4'h0}}, 4'hF} << send_addr_offset* 4;
Might this bit modification can be realized in Calyx source code.
Reopening this! @zzy666666zzy we were all traveling last week so can look into this a bit more now that we’re back.
Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt
Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'ext_mem2'_* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size.
For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:
12'h18 : begin
rdata[31:0] <= addr_ext_mem0_[31:0];
end
12'h1c : begin
rdata[31:0] <= addr_ext_mem0_[63:32];
end
12'h20 : begin
rdata[31:0] <= addr_ext_mem2_[31:0];
end
12'h24 : begin
rdata[31:0] <= addr_ext_mem2_[63:32];
end
12'h28 : begin
rdata[31:0] <= addr_ext_mem1_[31:0];
end
12'h2c : begin
rdata[31:0] <= addr_ext_mem1_[63:32];
end
But for 16x16 the order is:
12'h18 : begin
rdata[31:0] <= addr_ext_mem1_[31:0];
end
12'h1c : begin
rdata[31:0] <= addr_ext_mem1_[63:32];
end
12'h20 : begin
rdata[31:0] <= addr_ext_mem0_[31:0];
end
12'h24 : begin
rdata[31:0] <= addr_ext_mem0_[63:32];
end
12'h28 : begin
rdata[31:0] <= addr_ext_mem2_[31:0];
end
12'h2c : begin
rdata[31:0] <= addr_ext_mem2_[63:32];
end
The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.
Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?
Thanks a lot.
Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt
Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached),
arg0is the output argument (outputext_mem0),arg1andarg2are input matrix (ext_mem1_*and 'ext_mem2'_* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: insideControl_aximodule, of 4x4 GEMM, the order of the three ports is:12'h18 : begin rdata[31:0] <= addr_ext_mem0_[31:0]; end 12'h1c : begin rdata[31:0] <= addr_ext_mem0_[63:32]; end 12'h20 : begin rdata[31:0] <= addr_ext_mem2_[31:0]; end 12'h24 : begin rdata[31:0] <= addr_ext_mem2_[63:32]; end 12'h28 : begin rdata[31:0] <= addr_ext_mem1_[31:0]; end 12'h2c : begin rdata[31:0] <= addr_ext_mem1_[63:32]; endBut for 16x16 the order is:
12'h18 : begin rdata[31:0] <= addr_ext_mem1_[31:0]; end 12'h1c : begin rdata[31:0] <= addr_ext_mem1_[63:32]; end 12'h20 : begin rdata[31:0] <= addr_ext_mem0_[31:0]; end 12'h24 : begin rdata[31:0] <= addr_ext_mem0_[63:32]; end 12'h28 : begin rdata[31:0] <= addr_ext_mem2_[31:0]; end 12'h2c : begin rdata[31:0] <= addr_ext_mem2_[63:32]; endThe sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through
s_axi_control.Is there any approach to avoid the shuffled sequence of
addr_ext_mem*_?Thanks a lot.
I updated the calyx repo, the shuffled sequence of ext_mem* seems being solved.