torchtitan
torchtitan copied to clipboard
[Feature] expose Torch Nan checker as configurable option in toml for those training at scale
Used Ke's Nan checker (landed in torch a few moths ago) this weekend to finally pin down an extremely painful errant GPU that was causing Nan's throughout the weekend's training. Most user's don't even know this exists. Thus, idea here is let's add it as an easily available option in toml and document what it does.
A related ask: https://github.com/pytorch/torchtitan/issues/916. Should we add the checker by default and raise assert if Nan happen?