chenyu issues

Results 80 issues of


                                            chenyu

kernel index can overflow int32

#3271 and beam searching resnet for example - [x] assert if index > int32 #4157 - [ ] fix linearizer and check index max, use int64 if needed. assert if...

remove per layer realize in llama

previously needed due to kernel buffer count limit, seems fine to remove now

change sqrt and reciprocal order in rsqrt

this matches rsqrt backward at 0 to torch. also added exact tests for reciprocal, sqrt, and rsqrt see https://pytorch.org/docs/stable/notes/autograd.html#gradients-for-non-differentiable-functions for how torch defines derivative at these points

differentiable setitem

example usage from discord ``` for t in range(0, inp_seq_len): model(X[:, t].to(device)) y_pred = torch.zeros(Y.size()).to(device) for i in range(0, out_seq_len): y_pred[:, i] = model() loss = criterion(y_pred.to(device), Y.to(device)) loss.backward() clip_grads(model)...

tinybox launch apps

targeting early June - [ ] device test plans - [ ] onboarding - [ ] LLM 8B 200+ tok/s - [ ] 70B llama works - [ ] mixtral...

0.9 release

targeting end of May - [ ] docs - [ ] setitem #4574 - [ ] `pip install tinygrad` well tested - [ ] all tests pass locally on tinyboxes...

getitem advance indexing uses a lot more flop / mem

example with rewriting `sparse_categorical_crossentropy` using getitem ``` from tinygrad import Tensor, GlobalCounters X = Tensor.rand(256, 1000).realize() Y = Tensor.randint(256, low=0, high=10).realize() GlobalCounters.reset() X.sparse_categorical_crossentropy(Y, label_smoothing=0.1).realize() print(f"{GlobalCounters.global_ops=}, {GlobalCounters.global_mem=}, {GlobalCounters.kernel_count=}") def scc2(self, Y:Tensor,...

chenyu