torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Adding config options for deterministic execution

Open githubsgi opened this issue 2 months ago • 6 comments

use_deterministic_algorithms() warn_only ac preserve_rng_state ac debug

refer: https://github.com/pytorch/torchtitan/issues/1736

githubsgi avatar Sep 26 '25 03:09 githubsgi

@tianyu-l , I understand your concern about inflating the number of configs. In brief, these are some knobs I needed to figure out to debug issues related to deterministic compute and activation checkpoint recompute discrepancies. Deterministic, full or partial, is a very important for debugging numerical issues associated with randomness as you know. Activation checkpointing is an important techniques to reduce memory pressure, hence getting details on where exactly activation checkpointing is failing is important. These knobs would make TorchTitan debugging more friendly and quicker for new models and accelerators significantly. Otherwise, every end user developer of TorchTitan would need to hunt for where and how to add these debugging hooks.

I can add the above as proposal text ( and some more details ) to the issue mentioned above, if that works.

githubsgi avatar Sep 26 '25 17:09 githubsgi

OK, fine, I guess it's not that hard to convince me.

Instead of scattering debugging configs around, I wonder if we can put all debug options into one config called Debug under JobConfig, and clearly document in helper messages (1) what each config is for and (2) pointers to resources where people can read more.

Also would appreciate if you could share your understanding on AC for DSv3 training.

tianyu-l avatar Sep 26 '25 22:09 tianyu-l

@tianyu-l, sounds like you want to pull in all the debug config under a separate section in the toml file . Would it be something like the following in the toml file ? That would require more changes to the code, as some function calls only pass part of the configs !

[troubleshoot] deterministic = false deterministic_warn_only = false preserve_rng_state = false determinism_check = "default" ac_debug = false

I have not looked into DSv3 , yet. Will try it out.

githubsgi avatar Sep 29 '25 18:09 githubsgi

Yes. I think it'd be good to put all debug related configs together, including the random seed and the option added in https://github.com/pytorch/torchtitan/pull/1670. It will be a bigger refactor indeed.

@fegin if you have preference

tianyu-l avatar Sep 29 '25 21:09 tianyu-l

@fegin , what is your thought on this ?

githubsgi avatar Sep 30 '25 17:09 githubsgi

@tianyu-l , does the following debug(?) section look ok ? Which ever PR gets merged last - this or that can expand the debug section.


[debug]
torch_deterministic = false
torch_deterministic_warn_only = false
torch_preserve_rng_state = false
ac_determinism_check = "default"
ac_debug = false

githubsgi avatar Oct 03 '25 23:10 githubsgi