Kunwar Raj Singh
Kunwar Raj Singh
1. yes, limiting number of buffers passed directly would solve the compilation issue 2. Some form of early realizing would still be needed, as this simple code takes indefinately long...
masked_index function in _bilinear_interpolate (https://github.com/geohot/tinygrad/pull/884) is blocked by this
By converting everthing into 1-D, this can be achieved ``` import numpy as np from tinygrad.tensor import Tensor, dtypes input_ = Tensor.rand(2, 4, 6, 8) a = Tensor.uniform(*(2, 1, 1,...
With OneCycle LR Scheduling and weight decay, jump from 82.79 --> 87.94 With TTA FLIP and training FLIP, jump from 87.94 --> 90.01
@geohot I've just crossed 90%, though some tweaking of hyperparams could probably lead to higher acc
@geohot so its not that eval is slower, the previous eval was just on a single batch and I changed it to run on entire 10k data (which will be...
`eval 9008/10000 90.08%, 0.41 val_loss STEP=2000` BTW @geohot noticed that it says STEP=2000, if you are running with BS=256 and STEPS=2000, its not needed, default hparameters set in train_cifar perform...
Also, I noticed that removing JIT completely doesnt have a drastic performance hit, on my 3060 Mobile I get ~1.4 TFLOPS with JIT (full JIT and current hacked JIT give...
@geohot should be ready now! training step is fully jitted, and I added mixup to get a small boost in accuracy, because it didnt need many lines Tried label smoothing...
> I don't think gradient clipping is the right fix. Is that how torch is doing it? Agree, after gradient clipping there are no NaNs but the training diverges. looking...