DeepSpeed
DeepSpeed copied to clipboard
[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`
When zero.Init is used how can a user override the timeout dist init arg? as the dist is inited here:
https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/runtime/zero/partition_parameters.py#L654-L655
Of course the user could do something like this before instantiating the model:
from datetime import timedelta
import torch
[...]
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(backend="nccl", timeout=timedelta(seconds=3*60*60)))
but I'm not sure this would even work with deepspeed now using its own comms module and the init calls is much more complex - how would they know to provide the right args?
https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/comm/comm.py#L376-L383
but perhaps the other alternative is to give users an API to override the default default_pg_timeout, so then they could just do:
form deepspeed import something
something.set_default_pg_timeout(3*60*60)
model = AutoModel.from_pretrained(...) # which internally calls `zero.Init` in transformers
for context: we are dealing with a silent crash in GPU followed by a timeout, so in order to be able to catch this event in action we were trying to extend the timeout to something much longer and run into this issue. Going to try the workaround I proposed at the top of this post.
@tjruwase
Do you think a ds_config option would work here? zero.Init has access to the ds_config.
Yes, that would be another way on how to do it. I just thought that as this is very rarely going to be used it won't be worthwhile adding to the config, but by all means, yes, it'd work.
I've been facing NCCL timeouts and even after setting timeout to larger value in AcceleratorState, I still face this error. Finally, I see that this is on the deepspeed side.
It's a bit complex to figure out the right place to initialize the process group. So for now, I've done an ugly thing and modified the constants.py directly and increased the timeout.
Would be great to have this feature!
same error here , an update?
same error, how is it going now?
How to solve it, same error
how to solve it?
If you're using Deepspeed@Accelerate, it should work fine there since summer.
Try the approach shown in this reply: https://github.com/huggingface/accelerate/issues/1401#issuecomment-1543257739
I have just tried the code from above and it works for me. e.g. I set the timeout to 1 sec, and then the timeout easily gets triggered - e.g. inside the dataloader code creation.
Also add export TORCH_CPP_LOG_LEVEL=INFO and then check the logged value of TIMEOUT(ms) as demoed in the comment I linked to above.
Make sure you get the numbers right when you assign the timeout. e.g. for 3h timeout=timedelta(seconds=3*60*60)
I'm using accelerate==0.25, torch==2.12, deepspeed==0.12.6, zero3_init_flag: true