Replace FMA's LZC with CVW's LZA
Replace leading zero counter with leading zero anticipator in FMA sum path
Summary
This PR optimizes the floating-point multiply-add (FMA) unit by replacing the sequential leading zero counter (LZC) in the sum path with a parallel leading zero anticipator (LZA). This change removes normalization from the critical path, significantly improving FMA performance.
Problem
The previous implementation computed the sum first, then counted leading zeros for normalization:
Multiply → Align → Add/Subtract → Count Leading Zeros → Normalize → Result
↑
Critical path bottleneck
This sequential approach added unnecessary latency to the FMA operation, as normalization had to wait for the complete sum calculation.
Solution
Added Schmookler's leading zero anticipation algorithm IEEEX, implemented in the Walley Core that predicts the normalization shift count in parallel with the sum computation:
Multiply → Align → Add/Subtract ──────────→ Normalize → Result
↓ ↗
└── Leading Zero Anticipator ──┘
(in parallel)
Technical Details
The LZA implementation:
- Uses carry-lookahead logic (P/G/K signals) to predict leading zero patterns
- Handles both addition and subtraction operations via the
subcontrol signal - Added logic to detect and handle miss-predictions by one
- Feeds the predicted shift count directly to the normalization stage
Testing
- Verified with Synopsys VC formal 's sequential equivalence check
- Proven to be equal
Summary Proofs:
----------------------------------------------------------------------------------------------------------------------
VpId | Name | Type | Parent | #A | #C | #S | #F | #I | Status | %
----------------------------------------------------------------------------------------------------------------------
0 | seqdef | root | nil | 13 | 3 | 13 | 0 | 0 | success | 100
0 | seqdef-rw | or | seqdef | - | - | - | - | - | - | -
0 | rw1_1 | int | seqdef-rw | 5 | 0 | 5 | 0 | 0 | success | 100
0 | rw1_1-ur | or | rw1_1 | - | - | - | - | - | - | -
0 | ur_1 | leaf | rw1_1-ur | 4 | 0 | 4 | 0 | 0 | success | 100
0 | rw1_1-dcp | decompose | rw1_1 | - | - | - | - | - | - | -
0 | idcp_1 | leaf | rw1_1-dcp | 4 | 0 | 4 | 0 | 0 | success | 100
----------------------------------------------------------------------------------------------------------------------
The following script can be used to verify that the proposed changes are sequentially equivalent to the current implementation with Synopsys VC formal 's sequential equivalence check (vcf -file script_below.tcl):
set_fml_appmode SEQ
set SCRIPT_DIR [file normalize [file join [file dirname [info script]] ]]
set flist_golden [list \
"common_cells/src/cf_math_pkg.sv" \
"common_cells/src/lzc.sv" \
"cvfpu/src/fpnew_pkg.sv" \
"cvfpu/src/fpnew_classifier.sv" \
"cvfpu/src/fpnew_rounding.sv" \
"cvfpu/src/fpnew_fma_multi.sv" \
]
set flist_impl [list \
"common_cells/src/cf_math_pkg.sv" \
"common_cells/src/lzc.sv" \
"cvfpu/src/fpnew_pkg.sv" \
"cvfpu/src/fpnew_classifier.sv" \
"cvfpu/src/fpnew_rounding.sv" \
"cvfpu/src/fpnew_fma_multi_new.sv" \
"cvfpu/vendor/cvw/fma/fmalza.sv" \
]
analyze -format sverilog -library spec -vcs $flist_golden +incdir+common_cells/include
analyze -format sverilog -library impl -vcs $flist_impl +incdir+common_cells/include
elaborate_seq -spectop fpnew_fma_multi -impltop fpnew_fma_multi
map_by_name -clock spec.clk_i
create_clock -period 100 spec.clk_i
create_reset spec.rst_ni -sense low
fvassume -expr {spec.src_fmt_i == 0}
fvassume -expr {spec.src2_fmt_i == 0}
fvassume -expr {spec.dst_fmt_i == 0}
sim_run -stable
sim_set_state -uninitialized -apply 0
check_fv -block
report_proofs
Make sure to have the correct paths to cvfpuand common_cells relative to where vcf is called.
cvfpu/src/fpnew_fma_multi_new.sv contains the changes of this patch, while cvfpu/src/fpnew_fma_multi.sv and all other source files are the current version from develop. Different source and destination formats can be tried manually (unfortunately, runtime explodes when attempting to constrain these more loosely via, e.g., spec.src_fmt_i inside {0,1,2,3,4}).
Hi @emustafa96. I tested the PR making use of the UVM testbench https://github.com/openhwgroup/cvfpu-uvm.git. In my test I set the FPU instance implementation in order to have merged slice for FMA unit so that the ADD MUL operations can stress your modifications. As a regression test I ran 10000 random transactions with random operation, operands, FP format and FP rounding mode repeated for 10 different seeds then the results have been compared with those given by the MPFR golden model. I can see that everything is fine so if you agree with my test and results I think that the PR can be merged.
Hi @rgiunti, Thank you for the efforts! Concluding from the formal equivalence check and your testing, I also think we can merge.