DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`

Open stas00 opened this issue 2 years ago • 4 comments

When zero.Init is used how can a user override the timeout dist init arg? as the dist is inited here:

https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/runtime/zero/partition_parameters.py#L654-L655

Of course the user could do something like this before instantiating the model:

from datetime import timedelta
import torch
[...]
                if not torch.distributed.is_initialized():
                    torch.distributed.init_process_group(backend="nccl", timeout=timedelta(seconds=3*60*60)))

but I'm not sure this would even work with deepspeed now using its own comms module and the init calls is much more complex - how would they know to provide the right args?

https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/comm/comm.py#L376-L383

but perhaps the other alternative is to give users an API to override the default default_pg_timeout, so then they could just do:

form deepspeed import something
something.set_default_pg_timeout(3*60*60)
model = AutoModel.from_pretrained(...) # which internally calls `zero.Init` in transformers

for context: we are dealing with a silent crash in GPU followed by a timeout, so in order to be able to catch this event in action we were trying to extend the timeout to something much longer and run into this issue. Going to try the workaround I proposed at the top of this post.

@tjruwase

stas00 avatar Mar 02 '23 18:03 stas00

Do you think a ds_config option would work here? zero.Init has access to the ds_config.

tjruwase avatar Mar 04 '23 11:03 tjruwase

Yes, that would be another way on how to do it. I just thought that as this is very rarely going to be used it won't be worthwhile adding to the config, but by all means, yes, it'd work.

stas00 avatar Mar 04 '23 16:03 stas00

I've been facing NCCL timeouts and even after setting timeout to larger value in AcceleratorState, I still face this error. Finally, I see that this is on the deepspeed side.

It's a bit complex to figure out the right place to initialize the process group. So for now, I've done an ugly thing and modified the constants.py directly and increased the timeout.

Would be great to have this feature!

sandeepchittilla avatar Apr 14 '23 15:04 sandeepchittilla

same error here , an update?

bestpredicts avatar May 09 '23 15:05 bestpredicts

same error, how is it going now?

qingqiuhe avatar Aug 08 '23 08:08 qingqiuhe

How to solve it, same error

mynewstart avatar Aug 25 '23 08:08 mynewstart

how to solve it?

chen278947895 avatar Jan 03 '24 13:01 chen278947895

If you're using Deepspeed@Accelerate, it should work fine there since summer.

Try the approach shown in this reply: https://github.com/huggingface/accelerate/issues/1401#issuecomment-1543257739

I have just tried the code from above and it works for me. e.g. I set the timeout to 1 sec, and then the timeout easily gets triggered - e.g. inside the dataloader code creation.

Also add export TORCH_CPP_LOG_LEVEL=INFO and then check the logged value of TIMEOUT(ms) as demoed in the comment I linked to above.

Make sure you get the numbers right when you assign the timeout. e.g. for 3h timeout=timedelta(seconds=3*60*60)

I'm using accelerate==0.25, torch==2.12, deepspeed==0.12.6, zero3_init_flag: true

stas00 avatar Jan 03 '24 18:01 stas00