synth_opt_adders icon indicating copy to clipboard operation
synth_opt_adders copied to clipboard

Optimization for large_number + small_number type sums?

Open opensource-fr opened this issue 2 years ago • 2 comments

Seeing that many of the operations I'm encountering are addition of a small bit-width number to a much larger-bit-width number (e.g. to accumulating a new 8-bit width value into a large 20-bit width value each clock cycle).

Knowing that for the example of 8bit + 20 bit number, that the 9th-20th bit of the "8-bit-width" number are all zeroes, seems we might be able to further optimize trees for this type of scenario.

Hoping to start the discussion, curious what you would suggest as a good starting point for this exploration?

opensource-fr avatar Jan 01 '23 09:01 opensource-fr

In my mind, this builds off of the topic of adding a constant value. The section I've linked there is old and badly-written, apologies.

Specifically, in your case of adding an 8-bit number and a 20-bit number:

  • Build the adder as usual, starting with the normal nodes of (G, P) ■ (G', P') = (G + PG', PP')
  • Any nodes that only handle bits 9-20 can be simplified to ( , P) ■ ( , P') = ( , PP'). As in, you cut out both the AOI and half of the pre-processing.
  • Any nodes that combine with a simplified node simplify to ( , P) ■ (G', P') = (PG', PP'). As in, the AOI turns into a NAND.

The only design-space consideration here is that while the delay of ( , P) ■ [ (G', P') ■ (G'', P'') ] is AOI + NAND, the delay of [ ( , P) ■ (G', P') ] ■ (G'', P'') is AOI + AOI. The design would want to account for this.
The simplest heuristic would thus be to build a specific structure for bits 0 -> 8 and only connect it to the higher bits on the final step. For example, specifically constructing c9 = p8[c8], rather than risking something like c9 = [ p8(g7 + p7g6) ] + [ p8 p7p6c6 ].

That strategy should be close to optimal. Things get more complicated when you account for factorization, but that gets complicated, and has coincidentally been "temporarily" removed from this library.

tdene avatar Jan 01 '23 14:01 tdene

So in short, the following method:

  • Build a structure for the lower bits.
  • Only connect it in into the higher bits on the very last step.

should be close to optimal, and can be expected to save about ~60% area and ~40% delay on the higher bits [compared to a full-blown adder].

tdene avatar Jan 01 '23 14:01 tdene