Roelof van Dijk
Roelof van Dijk
New faster sum implementation. Change to draft MR. - Loop over nodes once - Add flatten property to SumNodes - Only factoring when necessary - Simplify factoring code These functions...
sum ``` Total time: 0.266239 s File: /home/rvd/src/roelofvandijk/tinygrad/tinygrad/shape/symbolic.py Function: sum at line 69 Line # Hits Time Per Hit % Time Line Contents ============================================================== 69 @staticmethod 70 @profile 71 def...
Thanks! Should be good to go. There's an efficient specialization for the case n=2, but I can add that later (if 2 different types and no SumNode -> create SumNode)....
It would be very nice if there was a dedicated CI/CD test that evaluates a representative performance, e.g. for LLAMA. For now I have been using test_net_speed, test_speed_v_torch and the...
master (52b7105f8798d88feed0197691391f7b9aa4b013) ``` using testing llama python run time built model assigned empty tensors, doing warmup codegen runtime (median): 211.05ms, runs: 291.58, 210.03, 207.10, 205.42, 213.85, 255.18, 208.70, 205.04, 212.07,...
> This good to merge? Yes
master (52b7105f8798d88feed0197691391f7b9aa4b013) ``` using testing llama python run time built model assigned empty tensors, doing warmup codegen runtime: 234.29ms , runs: 301.52, 212.94, 247.08, 233.68, 218.72, 233.77, 259.32, 211.46, 207.98,...
Perfectly complementary, designed as such. Should give another easy 10% or so. I still feel that there's a layer of abstraction that could be removed here, since most ops modify...
This is a pure rewrite of the functions that were already there, reducing the number of arguments that get passed to each function to the necessary ones (and no `self`)....
This branch ``` codegen mean runtime: 134.16ms, runs: 152.20, 132.09, 163.07, 124.28, 124.31, 131.00, 129.97, 129.84, 127.28, 127.59 methodcache mean runtime: 127.37ms, runs: 132.33, 168.70, 123.20, 122.04, 120.86, 120.40, 120.13,...