Adding config options for deterministic execution
use_deterministic_algorithms() warn_only ac preserve_rng_state ac debug
refer: https://github.com/pytorch/torchtitan/issues/1736
@tianyu-l , I understand your concern about inflating the number of configs. In brief, these are some knobs I needed to figure out to debug issues related to deterministic compute and activation checkpoint recompute discrepancies. Deterministic, full or partial, is a very important for debugging numerical issues associated with randomness as you know. Activation checkpointing is an important techniques to reduce memory pressure, hence getting details on where exactly activation checkpointing is failing is important. These knobs would make TorchTitan debugging more friendly and quicker for new models and accelerators significantly. Otherwise, every end user developer of TorchTitan would need to hunt for where and how to add these debugging hooks.
I can add the above as proposal text ( and some more details ) to the issue mentioned above, if that works.
OK, fine, I guess it's not that hard to convince me.
Instead of scattering debugging configs around, I wonder if we can put all debug options into one config called Debug under JobConfig, and clearly document in helper messages (1) what each config is for and (2) pointers to resources where people can read more.
Also would appreciate if you could share your understanding on AC for DSv3 training.
@tianyu-l, sounds like you want to pull in all the debug config under a separate section in the toml file . Would it be something like the following in the toml file ? That would require more changes to the code, as some function calls only pass part of the configs !
[troubleshoot] deterministic = false deterministic_warn_only = false preserve_rng_state = false determinism_check = "default" ac_debug = false
I have not looked into DSv3 , yet. Will try it out.
Yes. I think it'd be good to put all debug related configs together, including the random seed and the option added in https://github.com/pytorch/torchtitan/pull/1670. It will be a bigger refactor indeed.
@fegin if you have preference
@fegin , what is your thought on this ?
@tianyu-l , does the following debug(?) section look ok ? Which ever PR gets merged last - this or that can expand the debug section.
[debug]
torch_deterministic = false
torch_deterministic_warn_only = false
torch_preserve_rng_state = false
ac_determinism_check = "default"
ac_debug = false