OpenChatKit
OpenChatKit copied to clipboard
RuntimeError: Socket Timeout
sh training/finetune_Pythia-Chat-Base-7B.sh
Namespace(use_cuda=True, cuda_id=0, cuda_num=1, debug_mem=True, dist_backend='cupy_nccl', dp_backend='nccl', dist_url='tcp://127.0.0.1:7033', world_size= train_data=['./glue_dataset/data/QQP/train.tsv'], valid_data=['./glue_dataset/data/QQP/test.tsv'], tokenizer_type='BertWordPieceLowerCase', vocab_file='', train_log_backend='print', project_name='together', batch_size=32, micro_batch_size=1, lr=1e-05, num_iters=10, fp16=True, loss_scale=0, initial_loss_slreduce', gradient_accumulate_step=1, model_name='/data/app/OpenChatKit/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/', toketype='gptneox', checkpoint_path='/data/app/OpenChatKit/training/../model_ckpts/Pythia-Chat-Base-7B', task_name='/data/app/OpenChatKit/training/../data/OI_checkpoint=True, seed=42, profiling='no-profiling', trace_postfix='default', evaluation_steps=0, evaluation_data=None, evaluation_num_batch=None, checkp
Traceback (most recent call last):
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 358, in
Error reporting when running with a single gpu.
Getting same error here.
some of the other parameters need to be adjusted for single gpu:
< --num-layers 4 --embedding-dim 4096
< --world-size 1
Gets me:
Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0 comm init done!!
but i forgot to download the pretrained model (as per the training instructions), so it stopped there. Will post results once that step is complete.
cheers Darrin
Hi Darrin,
I'm aslo getting same error here with two gpu.
I only modify finetuning script:
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0
&
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1
In finetuning aslo need to :
--num-layers 4 --embedding-dim 4096
--world-size 2 --pipeline-group-size 4 --data-group-size 2
right?
I have tried with single gpu and modified the related parameters, but met below issue:
File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators
assert args.world_size == args.data_group_size * args.pipeline_group_size
AssertionError
Thanks Yuanyuan
It won't train on my 12GB GPU, it runs out of memory. It requires more VRAM than I currently have.
@darrinh The fine tuning script will most likely not work on 12 GB VRAM. I'd recommend using LoRa for fine-tuning instead.
Here's some sample code to get you started: https://github.com/togethercomputer/OpenChatKit/blob/ecfe4d5d9b5f4b1a533c4468cc1b7e1107b9a819/training/lora/redpajama-incite-chat-3b.py
Thanks @orangetin , it starts but quickly runs out of memory. Thanks for the link, will check it out.
thanks
Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 1 --rank 1 In finetuning aslo need to : --num-layers 4 --embedding-dim 4096 --world-size 2 --pipeline-group-size 4 --data-group-size 2 right? I have tried with single gpu and modified the related parameters, but met below issue: File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError
Thanks Yuanyuan
@yxy123 The arguments provided are invalid. args.world_size == args.data_group_size * args.pipeline_group_size
must be true.
Change this line > --world-size 2 --pipeline-group-size 4 --data-group-size 2
so that world_size = pipline-group-size * data-group-size
@orangetin Got it, thanks very much, it worked.