LMFlow Multi-GPU operation seems to be problematic

I can confirm that 29500 is not being used.....

Traceback (most recent call last): File "/data1/xxx/chat/LMFlow/service/app.py", line 35, in model = AutoModel.get_model(model_args, tune_strategy='none', ds_config=ds_config, init_method="tcp://localhost:29501") File "/data1/xxx/chat/LMFlow/src/lmflow/models/auto_model.py", line 16, in get_model return HFDecoderModel(model_args, *args, **kwargs) File "/data1/xxx/chat/LMFlow/src/lmflow/models/hf_decoder_model.py", line 232, in init deepspeed.init_distributed() File "/home/xxx/miniconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 656, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/home/xxx/miniconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 36, in init self.init_process_group(backend, timeout, init_method, rank, world_size) File "/home/xxx/miniconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 40, in init_process_group torch.distributed.init_process_group(backend, File "/home/xxx/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 888, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/xxx/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 245, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/home/xxxi/miniconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 176, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

Apr 11 '23 03:04 theslugger

Thanks for your interest in LMFlow! I guess the problem is caused by misspecification of master port and master addr. Could you please try set

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29500

and launch app.py again to see if the problem still exists? Thanks 😄

Apr 11 '23 10:04 research4pan

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 --master_port 29500 app.py

Apr 20 '23 04:04 lyris

Currently, multi-gpu inference may encounter issues. I'll suggest using a single GPU. Thanks!

Apr 20 '23 06:04 shizhediao

@shizhediao I know your reply above sort of already says, 'no' but just in case something's changed, do you think it is practically possible to do RAFT on {fine-tuned Falcon-7b} model with {GPT-NEO-7B or DistilBert Reward Model} (which I have already fine-tuned) using 4 GPUs - each 22 GB ?

Tried everything, but it all stops with an error (right after _clean_text and _discard_sample stage.)

Aug 06 '23 19:08 gopstrit

Btw, in my latest run, the code just stops -without any error - such a silent heartbreak!

Aug 06 '23 20:08 gopstrit

Is it ok when using only a single GPU? From the screen capture, I cannot see any errors. It may be related to deepspeed. Could you try other configs like removing zero strategy or zero2 instead of zero3.

Aug 07 '23 15:08 shizhediao

Hi @shizhediao Thanks for your reply.

I began with single-GPU (~24GB); it stopped in here:

I then switched to AWS g5x.12 which has 4-GPU / ~24 GB each.

It then passed beyond the above point.

The thing is it loads the fine-tuned model only in GPU-0:

Tried loading the model manually to other GPU, manual allocation/feed in data to reward model to fine-tuned model using to.(cuda: 1/2/3).

But at the following specific code (in raft_trainer.py), the code automatically takes the model back to GPU-0:

Yesterday, I tried with both deepseed zero-2 and zero-3: yeah, it doesn't show any error, but simply exits the processs (Screenshot above).

Btw, I also tried using torch.nn.distribution

This is the error I get:

Aug 07 '23 16:08 gopstrit

Could you check the RAM usage? From the first picture, the killing might be caused by out of RAM

Aug 07 '23 17:08 shizhediao

Hi @shizhediao , thanks for your reply.

Btw, I'm testing this on raft_batch_size = 8

For the following: I've first used ds_config_zero2.json

The process stops with CUDA out of memory error

This is the exact point where the error occurs (in raft_aligner.py) (sorry had to add in some consoling print statements for myself :D): Following is the resource use just before the error occurs:

I also tried to run the code with ds_config_zero3.json

It stops with following:

Memory check before the error:

Aug 07 '23 18:08 gopstrit

I did a quick re-run (With zero-3) to capture the instantaneous cpu/ram usage when the error occurs:

Here's the resourse use (closer to when we get the error: but hadn't got the error at this point [_the memory seems to be exhausted here]

Then the memory comes back again, before the error message comes in

**So two questions here:

@shizhediao do you think this is the issue here?
do you see any possibility I can still use 4 GPU (24gb each); 'm especilly concerned about the model loading to GPU0 and frying it up. Any idea how we can use all 4 GPU?**

Thank you so much!

_

Aug 07 '23 18:08 gopstrit

@shizhediao Today, I tried running the program in an instance that has higher RAM. (But with same number/size of GPU).

I got pretty similar results.

I also tried running the program using accelerate launch (for which I stripped of the deepspeed line in run_raft.sh; created another test_runner.py file that would execute run_raft.sh and subsequently, raft_align.py).

I configured accelerate with both ds_config_zero2 and ds_config_zero3 as well as without deepspeed.

Also combined with load fine-tuned model with/without 8bit, with/with toech_dtype=torch.bloat16, and with/without device_map = "auto".

This is the result with ds_config_zero2 (without any additional model load settings) (i.e. no 8-bit, torchdtype, etc.) (accelerate lauch test_runner.py):

This is the result with ds_config_zero3

This is the result with ds_config_zero2 plus additional model load settings (i.e. load in 8bit=true)

All of the above stops with CUDA OOM error

Also, tried accelerate launch with ds_config_zero3 (plus, load model in 8bit; and device_map=auto) This time around, it stops with following error:

Aug 08 '23 11:08 gopstrit

Seems that CUDA OOM is the problem. Is it possible to use a GPU with larger memory? May be @WeiXiongUST could provide some practical suggestions.

Aug 09 '23 20:08 shizhediao

@shizhediao Today, I tried running the program in an instance that has higher RAM. (But with same number/size of GPU).

I got pretty similar results.

I also tried running the program using accelerate launch (for which I stripped of the deepspeed line in run_raft.sh; created another test_runner.py file that would execute run_raft.sh and subsequently, raft_align.py).

I configured accelerate with both ds_config_zero2 and ds_config_zero3 as well as without deepspeed.

Also combined with load fine-tuned model with/without 8bit, with/with toech_dtype=torch.bloat16, and with/without device_map = "auto".

This is the result with ds_config_zero2 (without any additional model load settings) (i.e. no 8-bit, torchdtype, etc.) (accelerate lauch test_runner.py):

This is the result with ds_config_zero3

This is the result with ds_config_zero2 plus additional model load settings (i.e. load in 8bit=true)

All of the above stops with CUDA OOM error

Also, tried accelerate launch with ds_config_zero3 (plus, load model in 8bit; and device_map=auto) This time around, it stops with following error:

You may check this issue 545 and try out the separate implementation, which only loads one model at a time to reduce the requirement of memory resource. We shall update the implementation in lmflow soon.

Aug 12 '23 02:08 WeiXiongUST

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

Sep 30 '23 19:09 shizhediao