Szymon Ożóg
Szymon Ożóg
- removes childless uops - checking for item size of 1 is not necessary in pointer arithmetics now that we have constant folding
With this and https://github.com/tinygrad/tinygrad/pull/3894 we are matching CUDA speed on `HALF=1 DEBUG=2 python3 extra/gemm/simple_matmul.py`
MRE: ``` from tinygrad import Device from extra.optimization.helpers import load_worlds, ast_str_to_lin from tinygrad.features.search import bufs_from_lin if __name__ == "__main__": ast_strs = load_worlds(filter_reduce=False, filter_novariable=True) ast_strs = [x for x in ast_strs...
Currently reduce op starts with acc goes with loded data and ends with acc, It's much more cache friendly if we do local data first and then end with acc,...
Running this kernel produces a different ordering of LOAD/WMMA for PTX giving worse performance: ` [Opt(op=OptOps.TC, axis=0, amt=0), Opt(op=OptOps.LOCAL, axis=0, amt=4), Opt(op=OptOps.UPCAST, axis=0, amt=0), Opt(op=OptOps.UNROLL, axis=0, amt=4), Opt(op=OptOps.LOCAL, axis=0, amt=2)]`...
## Description of bug / unexpected behavior After installing packages required to run a transcription model it throws an assertion error when trying to use it ## Expected behavior The...
This adds support for quantized deepseek versions from Unsloth: Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the...
Taken from https://github.com/ggerganov/llama.cpp/blob/3d68f034dad53f0f27ad626b2732ef48fbcea4ee/ggml/src/ggml-cuda/vecdotq.cuh#L18