torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

[Feature] expose Torch Nan checker as configurable option in toml for those training at scale

Open lessw2020 opened this issue 9 months ago • 1 comments

Used Ke's Nan checker (landed in torch a few moths ago) this weekend to finally pin down an extremely painful errant GPU that was causing Nan's throughout the weekend's training. Most user's don't even know this exists. Thus, idea here is let's add it as an easily available option in toml and document what it does.

lessw2020 avatar Mar 03 '25 07:03 lessw2020

A related ask: https://github.com/pytorch/torchtitan/issues/916. Should we add the checker by default and raise assert if Nan happen?

fegin avatar Mar 04 '25 06:03 fegin