cvfpu icon indicating copy to clipboard operation
cvfpu copied to clipboard

Replace FMA's LZC with CVW's LZA

Open emustafa96 opened this issue 7 months ago • 3 comments

Replace leading zero counter with leading zero anticipator in FMA sum path

Summary

This PR optimizes the floating-point multiply-add (FMA) unit by replacing the sequential leading zero counter (LZC) in the sum path with a parallel leading zero anticipator (LZA). This change removes normalization from the critical path, significantly improving FMA performance.

Problem

The previous implementation computed the sum first, then counted leading zeros for normalization:

Multiply → Align → Add/Subtract → Count Leading Zeros → Normalize → Result
                                      ↑
                               Critical path bottleneck

This sequential approach added unnecessary latency to the FMA operation, as normalization had to wait for the complete sum calculation.

Solution

Added Schmookler's leading zero anticipation algorithm IEEEX, implemented in the Walley Core that predicts the normalization shift count in parallel with the sum computation:

Multiply → Align → Add/Subtract ──────────→ Normalize → Result
           ↓                               ↗
           └── Leading Zero Anticipator ──┘
           (in parallel)

Technical Details

The LZA implementation:

  • Uses carry-lookahead logic (P/G/K signals) to predict leading zero patterns
  • Handles both addition and subtraction operations via the sub control signal
  • Added logic to detect and handle miss-predictions by one
  • Feeds the predicted shift count directly to the normalization stage

Testing

  • Verified with Synopsys VC formal 's sequential equivalence check
  • Proven to be equal
  Summary Proofs:
   ----------------------------------------------------------------------------------------------------------------------
    VpId |           Name |      Type |         Parent |     #A |     #C |     #S |     #F |     #I |    Status |     %
   ----------------------------------------------------------------------------------------------------------------------
       0 |         seqdef |      root |            nil |     13 |      3 |     13 |      0 |      0 |   success |   100
       0 |      seqdef-rw |        or |         seqdef |      - |      - |      - |      - |      - |         - |     -
       0 |          rw1_1 |       int |      seqdef-rw |      5 |      0 |      5 |      0 |      0 |   success |   100
       0 |       rw1_1-ur |        or |          rw1_1 |      - |      - |      - |      - |      - |         - |     -
       0 |           ur_1 |      leaf |       rw1_1-ur |      4 |      0 |      4 |      0 |      0 |   success |   100
       0 |      rw1_1-dcp | decompose |          rw1_1 |      - |      - |      - |      - |      - |         - |     -
       0 |         idcp_1 |      leaf |      rw1_1-dcp |      4 |      0 |      4 |      0 |      0 |   success |   100
   ----------------------------------------------------------------------------------------------------------------------

emustafa96 avatar Apr 29 '25 11:04 emustafa96

The following script can be used to verify that the proposed changes are sequentially equivalent to the current implementation with Synopsys VC formal 's sequential equivalence check (vcf -file script_below.tcl):

set_fml_appmode SEQ

set SCRIPT_DIR [file normalize [file join [file dirname [info script]] ]]

set flist_golden [list \
 "common_cells/src/cf_math_pkg.sv" \
 "common_cells/src/lzc.sv" \
 "cvfpu/src/fpnew_pkg.sv" \
 "cvfpu/src/fpnew_classifier.sv" \
 "cvfpu/src/fpnew_rounding.sv" \
 "cvfpu/src/fpnew_fma_multi.sv" \
]
set flist_impl [list \
 "common_cells/src/cf_math_pkg.sv" \
 "common_cells/src/lzc.sv" \
 "cvfpu/src/fpnew_pkg.sv" \
 "cvfpu/src/fpnew_classifier.sv" \
 "cvfpu/src/fpnew_rounding.sv" \
 "cvfpu/src/fpnew_fma_multi_new.sv" \
 "cvfpu/vendor/cvw/fma/fmalza.sv" \
]

analyze -format sverilog -library spec -vcs $flist_golden +incdir+common_cells/include
analyze -format sverilog -library impl -vcs $flist_impl +incdir+common_cells/include


elaborate_seq -spectop fpnew_fma_multi -impltop fpnew_fma_multi

map_by_name -clock spec.clk_i

create_clock -period 100 spec.clk_i
create_reset spec.rst_ni -sense low

fvassume -expr {spec.src_fmt_i == 0}
fvassume -expr {spec.src2_fmt_i == 0}
fvassume -expr {spec.dst_fmt_i == 0}

sim_run -stable
sim_set_state -uninitialized -apply 0

check_fv -block

report_proofs

Make sure to have the correct paths to cvfpuand common_cells relative to where vcf is called.

cvfpu/src/fpnew_fma_multi_new.sv contains the changes of this patch, while cvfpu/src/fpnew_fma_multi.sv and all other source files are the current version from develop. Different source and destination formats can be tried manually (unfortunately, runtime explodes when attempting to constrain these more loosely via, e.g., spec.src_fmt_i inside {0,1,2,3,4}).

emustafa96 avatar Jun 25 '25 12:06 emustafa96

Hi @emustafa96. I tested the PR making use of the UVM testbench https://github.com/openhwgroup/cvfpu-uvm.git. In my test I set the FPU instance implementation in order to have merged slice for FMA unit so that the ADD MUL operations can stress your modifications. As a regression test I ran 10000 random transactions with random operation, operands, FP format and FP rounding mode repeated for 10 different seeds then the results have been compared with those given by the MPFR golden model. I can see that everything is fine so if you agree with my test and results I think that the PR can be merged.

rgiunti avatar Oct 28 '25 16:10 rgiunti

Hi @rgiunti, Thank you for the efforts! Concluding from the formal equivalence check and your testing, I also think we can merge.

emustafa96 avatar Oct 28 '25 16:10 emustafa96