chenyu
chenyu
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. on 3090 ``` 997 138.80 ms run, 2.65 ms python, 136.15 ms...
looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.
it runs if device gpu supports bfloat. updated ci benchmark too
the output type of `sqrt, sin, exp2, log2` can be different from the input type. The breaks codegen assumptions and can result in imprecise result or compiler error. in clang,...
70B llama uses almost all vram and beam search new buffer allocation would fail due to out of resource. It would be nice if we can search live with some...
testing perf, also this might have issue with assign?
Appending fake global is incorrect because the unused buffer can be in the middle. And the buffer map information no available to the caller after compilation based on lib alone....
``` from tinygrad import Tensor, dtypes Tensor([1, 2, 3], dtype=dtypes.half).mean().realize() ``` generates ``` void r_3(half* restrict data0, const half* restrict data1) { float acc0 = 0.0f; for (int ridx0 =...
probably target dependent, `__fp16` is for ARM only? but CI passes too example, with `CLANG=1` ``` a = Tensor([1, 2, 3], dtype=dtypes.double) print(a.numpy()) print(a.cast(dtypes.half).numpy()) ``` source ``` #include #include #define...
benchmark on 3090/4090 - [ ] on hlb_cifar bf16 is 10% slower than half - [ ] resnet training in bf16 produces nan errors