nimlgen
nimlgen
Some simple kernels work, but stuck after running 3-4 as of now. Will investigate. Yeah, might put ioctl into a sep pr, also wanna include a fifo dumper into it
No I haven't spent much time on multigpus, just basic p2p experiments mapping cpu pages to both gpus, but it's slower than cuda's transfer. Want to fix and cleanup one...
As of now all tests pass for single gpu. Cleaned up and refactored into queues. Still want to clean up nv imports somehow and memory management (while refactoring for multigpu).
Should be good to merge. Tests and models look good
For a simple case like `(x@y)+(a@b)` we have this graph: turned into this one: So, instead of 2 kernels (1st just reduce, 2nd reduce+binary ops) we got 1 kernel with...
Working on reduces also got some major speed up on these tests (https://github.com/tinygrad/tinygrad/pull/1137). I decided to go with atomics to speed up slow reduces. I just got 80x slower to...
The main problem with generated reduces is that global_dim=1 and local_dim=256 (from the test_sum). It is just 256 threads (8 warps) running on 1 SM. This simply can't saturate the...
> with this change, the primary generated kernel uses global_dim=128 and localdim=32, for example see one of the generated kernels below. Hmm, this kernel is still not very efficient. It...
Thanks for testing. Should be fixed now (would appreciate if you check on your end as well). LMK if any other refactors are needed.
I am seeing the same output for LLVM, CLANG, CUDA (both async and synced memcpy). What command to repro this? ``` nimlgen@tiny15:~/tinygrad$ CLANG=1 python3 examples/mamba.py --prompt "Hello." None of PyTorch,...