intervitens
intervitens
Added splitK calculation, speeds up GSM8k benchmark run on L3-8B with bs=32 on a single 3090 ti from 8:47 to 7:57 and bs=1 throughput from 68 t/s to 93 t/s
I think it makes more sense to reshape `hidden_states` and `targets` to `[batch_size*seq_len]` inside the loss forward and to set `reduction` to `mean` instead of `sum` in order to match...
On further testing, seems like there are still issues - When `fused: false` using an LR scheduler + compile optimizer results in `nan` loss after first step - optimizer_in_bwd not...
https://github.com/pytorch/pytorch/issues/126514 Looks like the issue is caused by this pytorch bug, since it goes away when I modify `get_cosine_schedule_with_warmup` to have the minimum learning rate be a very small number...
This is the config that I can use to reproduce the issue: https://gist.github.com/intervitens/df9eef7fd3ff3eec979b6aa6214ea99c Running it with `tune run --nproc_per_node 4 full_finetune_distributed --config config_llama_1B.yaml ` results in loss becoming `nan` after...
Note that this script compares the models in float32 precision on CPU. When I modify the script to compare the models on GPU in bf16 precision, I suddenly see a...