torchtitan
torchtitan copied to clipboard
Grad scaler not in train state
Grad scaler factor needs to be saved in train state for proper reloading.
@BadrYoubiIdrissi Curious which cases you are using fp16 training for (if you can share)?
Hey ! Really sorry for forgetting about this issue !! We were testing some things on the (now aging) fair cluster (H2) which has V100 which don't support bf16, so we thought we'd give fp16 a try but it's quite unstable unfortunately... Maybe the grad scaler needs to be finetuned a bit to make it more stable.
close as we don't plan to support fp16, and have removed grad scaler in code