twitchcore icon indicating copy to clipboard operation
twitchcore copied to clipboard

Matmul module, skip OOO execution

Open evanmays opened this issue 3 years ago • 4 comments

Using Vivado, appears to fit on the mini fpga for 4x4 matrices (using one DSP per multiply and 45% of the LUTs). Left out the floating point multiply and add modules since their licensing was unclear. But they can be sourced from here https://people.ece.cornell.edu/land/courses/ece5760/DE1_SOC/HPS_peripherials/Floating_Point_index.html

I'm told fused multiply add can save some LUTs as well as adding through cascading instead of as a tree

Their implementation is 27 bit float not 18 so its possible we can fit a 8x8 matrix multiply by actually using 18 bit cherryfloats.

Two quotes from that page work as evidence that we can do a full matrix multiply and accumulate operation in one cycle. One cycle instructions can mean we only need a 2 stage pipeline which means we don't need to do any reorder buffers, superscalar, or VLIW

a Cyclone5 DSP unit for multiply and takes one cycle for a floating multiply and two for an floating add

one cycle, at 50MHz, seems to work for the combinational adder. Actually, you can chain up to three additions in one cycle of 50MHz. The FPGA resource used for one floating point combinatorial adder is about 550 ALMs.

evanmays avatar Sep 04 '21 20:09 evanmays

After finding Vivado's implementation & timing buttons, I was wrong about executing in one cycle.

Still, I was able to get the math working for a mathematically guaranteed 1 instruction per cycle pipeline (if you set up 9 operand forwards in key locations). Then, I benchmarked and realized we were hardly using our MXU. Now, I realize though that we need OOO so that we can maximize usage of our different functional units.

I made a tiny 100-line VLIW compiler. Each instruction has 4 slots. 2 of them can be anything except loads, and the other two must be loads. Example kernels can be viewed below!

From my local tiny grad fork, here's some numbers we can expect with this VLIW format and a really bad VLIW compiler

All numbers estimated for cherry two tape-out (32x32x32 matmul and 1GHz). Estimating for 4x4x4 takes too long.

Backward pass disabled

Model TFLOPs MXU Utilization Total Cycles Command Mega Kernel
EfficientNet 9 12% 672,785 PYTHONPATH="." DEBUG=1 CHERRY=1 python3 test/test_train.py TestTrain.test_efficientnet Github gist
Transformer 53 82% 53,705 PYTHONPATH="." DEBUG=1 CHERRY=1 python3 test/test_train.py TestTrain.test_transformer Github gist
Resnet 50 77% 1,338,461 PYTHONPATH="." DEBUG=1 CHERRY=1 python3 test/test_train.py TestTrain.test_resnet Github gist

Note: I'm only running resnet50

evanmays avatar Sep 08 '21 03:09 evanmays

Awesome stuff! I haven't had much time to work on this, though I've come around somewhat on VLIW and systolic arrays. Efficiency has a lot to do with data movement costs I believe.

geohot avatar Sep 08 '21 20:09 geohot

I find it interesting that all vliw implementations require the compiler to optimize the memory hierarchy and function unit utilization.

Optimizing memory accesses can be done locally. After hashing all addresses, a reorder buffer of size 10 can reorder memory instructions to get 0 bank conflicts.

I've added on to my previous 100 line compiler and written some new stuff that does this.

Meanwhile, function unit utilization for AI needs a multi thousand Reorder buffer. There's a thousand matmul followed by a thousand unop followed by a thousand binop. Even the M1 reorder buffer can't see enough of what's happening to optimize. It would only see one type of instruction at a time.

Even with a large ROB, you'd get 80% of your gains from optimizing ~20% of the common sequences.

For example. binop_add(matmul(A,B), C)

Optimizing this along with maybe 10 other common operation permutations would probably get 80% of the gains a compiler or superscalar could get.

Maybe just let humans optimize these by hand

evanmays avatar Sep 19 '21 18:09 evanmays

With 5 functional computational units (unopALU, binopALU, reduceALU, matmul, mulacc), a newbie writing the VLIW instructions in python wouldn't be too much different from the existing cherry.py code in tinygrad. Compiler doesn't have to worry about parallelizing computation. Compiler just worries about parallelizing memory accesses to avoid bank conflicts, which is a simple compiler with one optimization stage.

I'm not sure how/if this scales to multiple cores. I guess the compiler would need to be revisited then or the tensor ops would need to be rewritten to explicitly utilize the cores

evanmays avatar Sep 19 '21 21:09 evanmays