Rayan Hatout
Rayan Hatout
## Update First sighting of sub 200ms on M1, I think i've exhausted all the cheap tricks; getting the next 100ms down is all about some actual refactoring ``` loaded...
@geohot i’m down to ~180ms on my M1 (running DEBUG=0, OPTLOCAL=1, and profiler disabled), curious to see if this latest rev makes a significant difference on your machine
Is it fair game to disable assertions for this bounty (i.e. running with `python -O`)? I'm now down to `130ms` on my M1 Pro (theoretical lower bound is ~`70ms`) with...
Haha trust me i've been staring at `movement_op` for a very very long time Good to know you're fine with API being changed, i've been avoiding completely gutting the thing...
Down another 10ms (~110ms/token) on my machine We can probably get rid of a significant amount of overhead by having a dedicated method for each `MovementOps` in the `LazyBuffer` class...
``` def test_fold_conv_batchnorm_sgd(self): # TODO: with Tensor.training Tensor.training = True img = Tensor.ones(1,3,4,4) c1 = nn.Conv2d(3,32,3) bn = nn.BatchNorm2d(32, track_running_stats=False) opt = optim.SGD(optim.get_parameters([c1, bn])) with CLCache(allowed=18): # this is too...
Uhh so just noticed i'm using 3.11 and you're using 3.10 which is probably part of the reason why my local benchmark is way faster than yours On 3.10 I'm...
The next most obvious thing I see is breaking the ref cycle of `LazyBuffer -> LazyOp.get_buffers().children -> LazyBuffer` so we don't have to pay the price of instantiating a WeakSet...
generators are way slower in almost all places where we currently use them because the iterators are quite small so the generator overhead dominates
`x[::-1]` vs. `reversed` ``` X = [1, 2, 3, 4, 5, 6, 7, 8] def f1(x): return sum(reversed(x)) def f2(x): return sum(x[::-1]) t1 = timeit.timeit(lambda: f1(X), number=1000000) t2 = timeit.timeit(lambda:...