torchtitan Grad scaler not in train state

Grad scaler not in train state

Open BadrYoubiIdrissi opened this issue 11 months ago • 2 comments

Grad scaler factor needs to be saved in train state for proper reloading.

Mar 14 '24 02:03 BadrYoubiIdrissi

@BadrYoubiIdrissi Curious which cases you are using fp16 training for (if you can share)?

Mar 14 '24 22:03 awgu

Hey ! Really sorry for forgetting about this issue !! We were testing some things on the (now aging) fair cluster (H2) which has V100 which don't support bf16, so we thought we'd give fp16 a try but it's quite unstable unfortunately... Maybe the grad scaler needs to be finetuned a bit to make it more stable.

Mar 26 '24 22:03 BadrYoubiIdrissi

close as we don't plan to support fp16, and have removed grad scaler in code

May 03 '24 01:05 tianyu-l

torchtitan torchtitan copied to clipboard

Grad scaler not in train state

torchtitan
torchtitan copied to clipboard