calyx icon indicating copy to clipboard operation
calyx copied to clipboard

AXI returns unexpected 0s

Open zzy666666zzy opened this issue 2 years ago • 4 comments

Hi,

Following the previous issue #1470, we have managed to run a 4x4 GEMM on a Standalone (bared-metal) platform.

When we tried bigger matrix sizes (8x8, 16x16), I printed the output matrix from the ARM sides, the IP core only gave us 16 correct elements (2 rows of correct elements for 8x8, 1 row of correct elements for 16x16). The rest elements are all zeros: 1687208165571 The correct output should be 1687208241004

When I delve into the generated Toplevel AXI verilog code, I found some clues.

In our case, 'm1' outputs the resultant matrix. We go to the module Memory_controller_axi_1, and look at this snippet

reg [6:0] send_addr_offset;
...
assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset * 32;

bram_read_data is the result from the core GEMM logic (ext_mem0__write_data) and always produces correct results.

BUT, when we try bigger matrix sizes and send_addr_offset equals is bigger than 16, WDATA is always zero because it left shifts (>16)x32 bit which is over 512-b. Previously, we could get the correct result from 4x4 because we luckily saturated the 512-b with 16 resultant elements.

I am sure the core logic without the AXI wrapper can output correct results, we have tested 4x4 to 64x64. We have changed discover-external:default=xxx according to our matrix sizes. From the generated AXI verilog code we can tell that discover-external:default=xxx affects send_addr_offset and thereby affects the number of shifts. I attached the MLIR and the AXI code here for convenient reference. gemm8x8.mlir.txt gemm8x8_wrapper.sv.txt

Also the doubled under-score issue in main kernel_inst still exists, signal inconsistency leads to synthesis failure.

Appreciate it if there is any solution for returning 0s and the doubled underscore issue.

Many thanks.

zzy666666zzy avatar Jun 19 '23 21:06 zzy666666zzy

This issue can be addressed by adding the following Verilog to the destination master port in AXI top-level file

reg [3:0] shift_var;
always @(posedge ACLK) begin
if(memory_mode_state == 3) begin
	if(BVALID & BREADY) shift_var <= shift_var + 'd1;
	//else if (BVALID & BREADY & shift_var=='d15) shift_var<='d0;
	else shift_var <= shift_var;
end 
else shift_var <= 0;
end

And replace send_addr_offset with shift_var in

    assign WDATA = {{15{32'b0}}, bram_read_data} << send_addr_offset* 32;
    assign WSTRB = {{15{4'h0}}, 4'hF} << send_addr_offset* 4;

Might this bit modification can be realized in Calyx source code.

zzy666666zzy avatar Jun 26 '23 13:06 zzy666666zzy

Reopening this! @zzy666666zzy we were all traveling last week so can look into this a bit more now that we’re back.

rachitnigam avatar Jun 26 '23 13:06 rachitnigam

Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt

Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'ext_mem2'_* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end

But for 16x16 the order is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end

The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.

Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?

Thanks a lot.

zzy666666zzy avatar Jun 27 '23 13:06 zzy666666zzy

Hi @rachitnigam circt_calyx4x4.sh.txt gemm4x4.mlir.txt gemm8x8.mlir.txt

Just realized another issue. In my matrix multiplications example (please see the MLIR and my script attached), arg0 is the output argument (output ext_mem0), arg1 and arg2 are input matrix (ext_mem1_* and 'ext_mem2'_* in Verilog file). But in the generated AXI file, the order of these three ports are randomly swapped when I increase the matrix size. For example: inside Control_axi module, of 4x4 GEMM, the order of the three ports is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end

But for 16x16 the order is:

                12'h18 : begin
                    rdata[31:0] <= addr_ext_mem1_[31:0];
                end
                12'h1c : begin
                    rdata[31:0] <= addr_ext_mem1_[63:32];
                end
                12'h20 : begin
                    rdata[31:0] <= addr_ext_mem0_[31:0];
                end
                12'h24 : begin
                    rdata[31:0] <= addr_ext_mem0_[63:32];
                end
                12'h28 : begin
                    rdata[31:0] <= addr_ext_mem2_[31:0];
                end
                12'h2c : begin
                    rdata[31:0] <= addr_ext_mem2_[63:32];
                end

The sequence of these ports is important because for standalone(baremetal) platform we need to configure the base address of inputs and output ports manually through s_axi_control.

Is there any approach to avoid the shuffled sequence of addr_ext_mem*_?

Thanks a lot.

I updated the calyx repo, the shuffled sequence of ext_mem* seems being solved.

zzy666666zzy avatar Jun 27 '23 14:06 zzy666666zzy