chenyu
chenyu
tinybox HSA supports bf16 buffer, so we can use the bfloat16 weight and don't need https://github.com/tinygrad/tinygrad/blob/639bd5dbfcef4e329b6fbba2de571c8cf70ee95b/examples/llama.py#L179-L180 commenting that out, `python3 examples/llama.py --gen 2 --size 7B --shard 6 --prompt "Hello." --count...
support changing default_float to bfloat16 and train cifar with bf16 - [x] bf16 Tensor creation and numpy #3724 #3747 - [x] bf16 rand #3764 - [x] move bf16 cast hack...
training submission deadline is 5/10 https://docs.google.com/spreadsheets/d/1oeWuZ2GHb0r2d_v5gqUWVXhXVdTBTm0Qn_svyWYZPr4/edit#gid=538466233 submission rules https://github.com/mlcommons/policies/blob/master/submission_rules.adoc#42-single-submission-round-schedule
We drop the last incomplete batch now. For MLPerf eval it needs to use every data exactly once
argmax is finding max + finding argmax. The finding max part fused with softmax and introduced numerical error. The resulting max can be different from any of the tensor element,...
tracking issue, the goal is to cleanly separate sint used for symbolic shape, and the symbolic template used to general kernel indices. - sint does not need to include all...
check perf impact to see if it justifies the added complexity to support pointer arg
spec: https://data-apis.org/array-api/2022.12/API_specification/index.html tests: https://github.com/data-apis/array-api-tests
open for investigation. `BEAM=2 python examples/handcode_resnet50_opt.py` on METAL M1 Max 1. master: 297ms 2. removed all TC: 206ms 3. keep only TC axis=0: 186ms
except casting negative float to unsigned integer which is undefined