OpenChatKit icon indicating copy to clipboard operation
OpenChatKit copied to clipboard

RuntimeError: Socket Timeout

Open angeliababy opened this issue 1 year ago • 8 comments

sh training/finetune_Pythia-Chat-Base-7B.sh

Namespace(use_cuda=True, cuda_id=0, cuda_num=1, debug_mem=True, dist_backend='cupy_nccl', dp_backend='nccl', dist_url='tcp://127.0.0.1:7033', world_size= train_data=['./glue_dataset/data/QQP/train.tsv'], valid_data=['./glue_dataset/data/QQP/test.tsv'], tokenizer_type='BertWordPieceLowerCase', vocab_file='', train_log_backend='print', project_name='together', batch_size=32, micro_batch_size=1, lr=1e-05, num_iters=10, fp16=True, loss_scale=0, initial_loss_slreduce', gradient_accumulate_step=1, model_name='/data/app/OpenChatKit/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/', toketype='gptneox', checkpoint_path='/data/app/OpenChatKit/training/../model_ckpts/Pythia-Chat-Base-7B', task_name='/data/app/OpenChatKit/training/../data/OI_checkpoint=True, seed=42, profiling='no-profiling', trace_postfix='default', evaluation_steps=0, evaluation_data=None, evaluation_num_batch=None, checkp Traceback (most recent call last): File "/data/app/OpenChatKit/training/dist_clm_train.py", line 358, in main() File "/data/app/OpenChatKit/training/dist_clm_train.py", line 275, in main init_communicators(args) File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 85, in init_communicators default_init(args) File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 81, in default_init dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=5*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank) File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group default_pg = _new_process_group_helper( File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout

Error reporting when running with a single gpu.

angeliababy avatar Apr 21 '23 10:04 angeliababy

Getting same error here.

darrinh avatar May 19 '23 01:05 darrinh

some of the other parameters need to be adjusted for single gpu:

< --num-layers 4 --embedding-dim 4096
< --world-size 1 Gets me:

Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0 comm init done!!

but i forgot to download the pretrained model (as per the training instructions), so it stopped there. Will post results once that step is complete.

cheers Darrin

darrinh avatar May 19 '23 02:05 darrinh

Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0
&
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1
In finetuning aslo need to : --num-layers 4 --embedding-dim 4096
--world-size 2 --pipeline-group-size 4 --data-group-size 2
right? I have tried with single gpu and modified the related parameters, but met below issue: File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError

Thanks Yuanyuan

yxy123 avatar May 31 '23 02:05 yxy123

It won't train on my 12GB GPU, it runs out of memory. It requires more VRAM than I currently have.

darrinh avatar May 31 '23 05:05 darrinh

@darrinh The fine tuning script will most likely not work on 12 GB VRAM. I'd recommend using LoRa for fine-tuning instead.

Here's some sample code to get you started: https://github.com/togethercomputer/OpenChatKit/blob/ecfe4d5d9b5f4b1a533c4468cc1b7e1107b9a819/training/lora/redpajama-incite-chat-3b.py

orangetin avatar May 31 '23 05:05 orangetin

Thanks @orangetin , it starts but quickly runs out of memory. Thanks for the link, will check it out.

thanks

darrinh avatar May 31 '23 06:05 darrinh

Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 1 --rank 1 In finetuning aslo need to : --num-layers 4 --embedding-dim 4096 --world-size 2 --pipeline-group-size 4 --data-group-size 2 right? I have tried with single gpu and modified the related parameters, but met below issue: File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError

Thanks Yuanyuan

@yxy123 The arguments provided are invalid. args.world_size == args.data_group_size * args.pipeline_group_size must be true.

Change this line > --world-size 2 --pipeline-group-size 4 --data-group-size 2 so that world_size = pipline-group-size * data-group-size

orangetin avatar May 31 '23 06:05 orangetin

@orangetin Got it, thanks very much, it worked.

yxy123 avatar May 31 '23 07:05 yxy123