nimlgen comments

Results 11 comments of


                                            nimlgen

[WIP] nv driver

Some simple kernels work, but stuck after running 3-4 as of now. Will investigate. Yeah, might put ioctl into a sep pr, also wanna include a fifo dumper into it

[WIP] nv driver

No I haven't spent much time on multigpus, just basic p2p experiments mapping cpu pages to both gpus, but it's slower than cuda's transfer. Want to fix and cleanup one...

[WIP] nv driver

As of now all tests pass for single gpu. Cleaned up and refactored into queues. Still want to clean up nv imports somehow and memory management (while refactoring for multigpu).

[WIP] nv driver

Should be good to merge. Tests and models look good

[WIP] fuse several reduces into one kernel

For a simple case like `(x@y)+(a@b)` we have this graph: turned into this one: So, instead of 2 kernels (1st just reduce, 2nd reduce+binary ops) we got 1 kernel with...

Match Torch speed for sum reduction on M1

Working on reduces also got some major speed up on these tests (https://github.com/tinygrad/tinygrad/pull/1137). I decided to go with atomics to speed up slow reduces. I just got 80x slower to...

Match Torch speed for sum reduction on M1

The main problem with generated reduces is that global_dim=1 and local_dim=256 (from the test_sum). It is just 256 threads (8 warps) running on 1 SM. This simply can't saturate the...

Match Torch speed for sum reduction on M1

> with this change, the primary generated kernel uses global_dim=128 and localdim=32, for example see one of the generated kernels below. Hmm, this kernel is still not very efficient. It...

Init cudagraph

Thanks for testing. Should be fixed now (would appreciate if you check on your end as well). LMK if any other refactors are needed.

I am seeing the same output for LLVM, CLANG, CUDA (both async and synced memcpy). What command to repro this? ``` nimlgen@tiny15:~/tinygrad$ CLANG=1 python3 examples/mamba.py --prompt "Hello." None of PyTorch,...