chenyu issues

Results 80 issues of


                                            chenyu

llama concat_weights always send to device[0] first

removing that fixed the llama shard weight double copy. it was from #3966 but i think we fixed replace after that

experiment: in search, don't apply actions that caused slowdown

testing... cut search time by half, no perf diff in llama and gpt2

bf16 cifar on AMD benchmark has nan loss

bisected to this https://github.com/tinygrad/tinygrad/actions/runs/9435797854/job/25995095374 this one is fine https://github.com/tinygrad/tinygrad/actions/runs/9435244478/job/25988401090

slow lm_head matmul kernel

one very slow kernel in train_gpt2. `BEAM=2 DEBUG=2 python3 test/external/external_test_lm_head.py` gives 136ms. changing vocab_size to 50304, the kernel became 20ms. fixing this in optimization can make training step < 300ms....

commenting out this rewrite rule fixed bf16 cifar on red

#5051, but why?

symbolic arange

similar to conv pool, the shapetracker became weird when elements in the first row became part of the second row

fix reshape size check from symbolic shape to int shape

Added `.val` for all sint Nodes and use that to get int_size. Also reshape_and_permute shares the same code path, but the Variables are no longer bound so passed the check...

re:6125 switch real_size to use uops [run_process_replay]

change MOD range to take any positive UOp

fancy lt folding with gcd

wip, this matches symbolic and should be able to be included in existing patterns

Tensor method support dim <-> axis and keepdim <-> keepdims

default is dim and keepdim like torch, and axis / keepdims work