chenyu issues

Results 95 issues of


                                            chenyu

training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. on 3090 ``` 997 138.80 ms run, 2.65 ms python, 136.15 ms...

env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

bf16 support in metal

it runs if device gpu supports bfloat. updated ci benchmark too

unify handling math function return type in cstyle.py

the output type of `sqrt, sin, exp2, log2` can be different from the input type. The breaks codegen assumptions and can result in imprecise result or compiler error. in clang,...

BEAM search 70B llama on tinybox

70B llama uses almost all vram and beam search new buffer allocation would fail due to out of resource. It would be nice if we can search live with some...

remove CAST_BEFORE_VIEW

testing perf, also this might have issue with assign?

remove define global without child

Appending fake global is incorrect because the unused buffer can be in the middle. And the buffer map information no available to the caller after compilation based on lib alone....

Tensor.mean with half can loss precision

``` from tinygrad import Tensor, dtypes Tensor([1, 2, 3], dtype=dtypes.half).mean().realize() ``` generates ``` void r_3(half* restrict data0, const half* restrict data1) { float acc0 = 0.0f; for (int ridx0 =...

error casting between dtypes.double and dtypes.half on CLANG

probably target dependent, `__fp16` is for ARM only? but CI passes too example, with `CLANG=1` ``` a = Tensor([1, 2, 3], dtype=dtypes.double) print(a.numpy()) print(a.cast(dtypes.half).numpy()) ``` source ``` #include #include #define...

bf16 training examples

benchmark on 3090/4090 - [ ] on hlb_cifar bf16 is 10% slower than half - [ ] resnet training in bf16 produces nan errors