chenyu
chenyu
removing that fixed the llama shard weight double copy. it was from #3966 but i think we fixed replace after that
testing... cut search time by half, no perf diff in llama and gpt2
bisected to this https://github.com/tinygrad/tinygrad/actions/runs/9435797854/job/25995095374 this one is fine https://github.com/tinygrad/tinygrad/actions/runs/9435244478/job/25988401090
one very slow kernel in train_gpt2. `BEAM=2 DEBUG=2 python3 test/external/external_test_lm_head.py` gives 136ms. changing vocab_size to 50304, the kernel became 20ms. fixing this in optimization can make training step < 300ms....
#5051, but why?
similar to conv pool, the shapetracker became weird when elements in the first row became part of the second row
Added `.val` for all sint Nodes and use that to get int_size. Also reshape_and_permute shares the same code path, but the Variables are no longer bound so passed the check...
change MOD range to take any positive UOp
wip, this matches symbolic and should be able to be included in existing patterns
default is dim and keepdim like torch, and axis / keepdims work