Kunwar Raj Singh comments

Results 47 comments of


                                            Kunwar Raj Singh

Limit the number of ops which can be evaluated lazily

1. yes, limiting number of buffers passed directly would solve the compilation issue 2. Some form of early realizing would still be needed, as this simple code takes indefinately long...

Combined indexing, fancy indexing, broadcasting support

masked_index function in _bilinear_interpolate (https://github.com/geohot/tinygrad/pull/884) is blocked by this

Combined indexing, fancy indexing, broadcasting support

By converting everthing into 1-D, this can be achieved ``` import numpy as np from tinygrad.tensor import Tensor, dtypes input_ = Tensor.rand(2, 4, 6, 8) a = Tensor.uniform(*(2, 1, 1,...

Over 90% on CIFAR with examples/hlb_cifar10.py

With OneCycle LR Scheduling and weight decay, jump from 82.79 --> 87.94 With TTA FLIP and training FLIP, jump from 87.94 --> 90.01

Over 90% on CIFAR with examples/hlb_cifar10.py

@geohot I've just crossed 90%, though some tweaking of hyperparams could probably lead to higher acc

Over 90% on CIFAR with examples/hlb_cifar10.py

@geohot so its not that eval is slower, the previous eval was just on a single batch and I changed it to run on entire 10k data (which will be...

Over 90% on CIFAR with examples/hlb_cifar10.py

`eval 9008/10000 90.08%, 0.41 val_loss STEP=2000` BTW @geohot noticed that it says STEP=2000, if you are running with BS=256 and STEPS=2000, its not needed, default hparameters set in train_cifar perform...

Over 90% on CIFAR with examples/hlb_cifar10.py

Also, I noticed that removing JIT completely doesnt have a drastic performance hit, on my 3060 Mobile I get ~1.4 TFLOPS with JIT (full JIT and current hacked JIT give...

Over 90% on CIFAR with examples/hlb_cifar10.py

@geohot should be ready now! training step is fully jitted, and I added mixup to get a small boost in accuracy, because it didnt need many lines Tried label smoothing...

Over 90% on CIFAR with examples/hlb_cifar10.py

> I don't think gradient clipping is the right fix. Is that how torch is doing it? Agree, after gradient clipping there are no NaNs but the training diverges. looking...