Szymon Ożóg issues

Results 19 issues of


                                            Szymon Ożóg

ptx cleanup

- removes childless uops - checking for item size of 1 is not necessary in pointer arithmetics now that we have constant folding

improved caching for pointer arithmetics in ptx

With this and https://github.com/tinygrad/tinygrad/pull/3894 we are matching CUDA speed on `HALF=1 DEBUG=2 python3 extra/gemm/simple_matmul.py`

Cuda failes with 701, too many resources requested for launch when running some kernels from load_worlds

MRE: ``` from tinygrad import Device from extra.optimization.helpers import load_worlds, ast_str_to_lin from tinygrad.features.search import bufs_from_lin if __name__ == "__main__": ast_strs = load_worlds(filter_reduce=False, filter_novariable=True) ast_strs = [x for x in ast_strs...

Cache friendly reduceop

Currently reduce op starts with acc goes with loded data and ends with acc, It's much more cache friendly if we do local data first and then end with acc,...

Linearize produces worse ordering for PTX than CUDA

Running this kernel produces a different ordering of LOAD/WMMA for PTX giving worse performance: ` [Opt(op=OptOps.TC, axis=0, amt=0), Opt(op=OptOps.LOCAL, axis=0, amt=4), Opt(op=OptOps.UPCAST, axis=0, amt=0), Opt(op=OptOps.UNROLL, axis=0, amt=4), Opt(op=OptOps.LOCAL, axis=0, amt=2)]`...

Szymon Ożóg

ptx cleanup

improved caching for pointer arithmetics in ptx

Int mulacc for ptx

Cuda failes with 701, too many resources requested for launch when running some kernels from load_worlds

Cache friendly reduceop

Linearize produces worse ordering for PTX than CUDA

Assertion error when trying to run with a transcription model

[Model] Deepseek GGUF support

Missing comment explaining VDR variable in GGUF kernels