chenyu

Results 80 issues of chenyu

removing that fixed the llama shard weight double copy. it was from #3966 but i think we fixed replace after that

testing... cut search time by half, no perf diff in llama and gpt2

bisected to this https://github.com/tinygrad/tinygrad/actions/runs/9435797854/job/25995095374 this one is fine https://github.com/tinygrad/tinygrad/actions/runs/9435244478/job/25988401090

one very slow kernel in train_gpt2. `BEAM=2 DEBUG=2 python3 test/external/external_test_lm_head.py` gives 136ms. changing vocab_size to 50304, the kernel became 20ms. fixing this in optimization can make training step < 300ms....

similar to conv pool, the shapetracker became weird when elements in the first row became part of the second row

Added `.val` for all sint Nodes and use that to get int_size. Also reshape_and_permute shares the same code path, but the Variables are no longer bound so passed the check...

change MOD range to take any positive UOp

wip, this matches symbolic and should be able to be included in existing patterns

default is dim and keepdim like torch, and axis / keepdims work