DeepSpeed [feature request] unable to override `dist.init_process

When zero.Init is used how can a user override the timeout dist init arg? as the dist is inited here:

https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/runtime/zero/partition_parameters.py#L654-L655

Of course the user could do something like this before instantiating the model:

from datetime import timedelta
import torch
[...]
                if not torch.distributed.is_initialized():
                    torch.distributed.init_process_group(backend="nccl", timeout=timedelta(seconds=3*60*60)))

but I'm not sure this would even work with deepspeed now using its own comms module and the init calls is much more complex - how would they know to provide the right args?

https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/comm/comm.py#L376-L383

but perhaps the other alternative is to give users an API to override the default default_pg_timeout, so then they could just do:

form deepspeed import something
something.set_default_pg_timeout(3*60*60)
model = AutoModel.from_pretrained(...) # which internally calls `zero.Init` in transformers

for context: we are dealing with a silent crash in GPU followed by a timeout, so in order to be able to catch this event in action we were trying to extend the timeout to something much longer and run into this issue. Going to try the workaround I proposed at the top of this post.

@tjruwase

Mar 02 '23 18:03 stas00

Do you think a ds_config option would work here? zero.Init has access to the ds_config.

Mar 04 '23 11:03 tjruwase

Yes, that would be another way on how to do it. I just thought that as this is very rarely going to be used it won't be worthwhile adding to the config, but by all means, yes, it'd work.

Mar 04 '23 16:03 stas00

I've been facing NCCL timeouts and even after setting timeout to larger value in AcceleratorState, I still face this error. Finally, I see that this is on the deepspeed side.

It's a bit complex to figure out the right place to initialize the process group. So for now, I've done an ugly thing and modified the constants.py directly and increased the timeout.

Would be great to have this feature!

Apr 14 '23 15:04 sandeepchittilla

same error here , an update?

May 09 '23 15:05 bestpredicts

same error, how is it going now?

Aug 08 '23 08:08 qingqiuhe

How to solve it, same error

Aug 25 '23 08:08 mynewstart

how to solve it?

Jan 03 '24 13:01 chen278947895

If you're using Deepspeed@Accelerate, it should work fine there since summer.

Try the approach shown in this reply: https://github.com/huggingface/accelerate/issues/1401#issuecomment-1543257739

I have just tried the code from above and it works for me. e.g. I set the timeout to 1 sec, and then the timeout easily gets triggered - e.g. inside the dataloader code creation.

Also add export TORCH_CPP_LOG_LEVEL=INFO and then check the logged value of TIMEOUT(ms) as demoed in the comment I linked to above.

Make sure you get the numbers right when you assign the timeout. e.g. for 3h timeout=timedelta(seconds=3*60*60)

I'm using accelerate==0.25, torch==2.12, deepspeed==0.12.6, zero3_init_flag: true

Jan 03 '24 18:01 stas00

DeepSpeed
DeepSpeed copied to clipboard

[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`

DeepSpeed DeepSpeed copied to clipboard

[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`

DeepSpeed
DeepSpeed copied to clipboard