torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Grad scaler not in train state

Open BadrYoubiIdrissi opened this issue 11 months ago • 2 comments

Grad scaler factor needs to be saved in train state for proper reloading.

BadrYoubiIdrissi avatar Mar 14 '24 02:03 BadrYoubiIdrissi

@BadrYoubiIdrissi Curious which cases you are using fp16 training for (if you can share)?

awgu avatar Mar 14 '24 22:03 awgu

Hey ! Really sorry for forgetting about this issue !! We were testing some things on the (now aging) fair cluster (H2) which has V100 which don't support bf16, so we thought we'd give fp16 a try but it's quite unstable unfortunately... Maybe the grad scaler needs to be finetuned a bit to make it more stable.

BadrYoubiIdrissi avatar Mar 26 '24 22:03 BadrYoubiIdrissi

close as we don't plan to support fp16, and have removed grad scaler in code

tianyu-l avatar May 03 '24 01:05 tianyu-l