ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

torchrun runs normally vs colossalai run error report

Open yangzhipeng1108 opened this issue 5 months ago • 2 comments

🐛 Describe the bug

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=localhost --master_port=30013 train.py --pretrained /root/ColossalAI/colossalai/Colossal-LLaMA-2-7b-base --dataset /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00000 /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00001 /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00002
/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00003 /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00004
/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00005 /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00006
/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00007 /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00008
/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00009
--plugin zero2_cpu --save_interval 400
--save_dir /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/gpt4_data_zh-2024-03-20-11-18-29 --tensorboard_dir /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/tensorboard/gpt4_data_zh-2024-03-20-11-18-29
--config_file /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/config/gpt4_data_zh-2024-03-20-11-18-29.json --num_epochs 1 --accumulation_steps 8 --micro_batch_size 1 --lr 5e-5 --mixed_precision bf16
--grad_clip 1.0 --weight_decay 0.01 --warmup_steps 100 --use_grad_checkpoint --use_flash_attn --use_neft --pad_token eos

run ok

bash train_sft.sh run error /bin/bash: -c: line 0: unexpected EOF while looking for matching `"' /bin/bash: -c: line 1: syntax error: unexpected end of file

#!/bin/bash

NCCL IB environment variables

export NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1 export NCCL_IB_DISABLE=0 export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_GID_INDEX=3 export NCCL_IB_TIMEOUT=23 export NCCL_IB_RETRY_CNT=7 export OMP_NUM_THREADS=8

PROJECT_NAME="gpt4" PARENT_SAVE_DIR="/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/" PARENT_TENSORBOARD_DIR="/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/tensorboard/" PARENT_CONFIG_FILE="/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/config/" PRETRAINED_MODEL_PATH="/root/ColossalAI/colossalai/Colossal-LLaMA-2-7b-base/"

declare -a dataset=( "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00000" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00001" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00002" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00003" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00004" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00005" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00006" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00007" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00008" "/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00009"

)

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S) FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}" SAVE_DIR="${PARENT_SAVE_DIR}${FULL_PROJECT_NAME}" TENSORBOARD_DIR="${PARENT_TENSORBOARD_DIR}${FULL_PROJECT_NAME}" CONFIG_FILE="${PARENT_CONFIG_FILE}${FULL_PROJECT_NAME}.json"

colossalai run --nproc_per_node 8 --master_port 30013 train.py
--pretrained $PRETRAINED_MODEL_PATH
--dataset ${dataset[@]}
--plugin "zero2_cpu"
--save_interval 400
--save_dir $SAVE_DIR
--tensorboard_dir $TENSORBOARD_DIR
--config_file $CONFIG_FILE
--num_epochs 1
--accumulation_steps 8
--micro_batch_size 1
--lr 5e-5
--mixed_precision "bf16"
--grad_clip 1.0
--weight_decay 0.01
--warmup_steps 100
--use_grad_checkpoint
--use_flash_attn
--use_neft
--pad_token "eos"

Environment

No response

yangzhipeng1108 avatar Mar 21 '24 02:03 yangzhipeng1108

Hi, it seems like your bash script syntax problem

/bin/bash: -c: line 0: unexpected EOF while looking for matching `"'
/bin/bash: -c: line 1: syntax error: unexpected end of file

Suggest checking whether there lefts opening quotes, also check if you used quotes from the Chinese input method

char-1ee avatar Mar 22 '24 06:03 char-1ee

colossalai run --nproc_per_node 8 train.py
--pretrained /root/ColossalAI/colossalai/Colossal-LLaMA-2-7b-base
--dataset /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/spliced_tokenized_output_arrow/part-00000
--save_dir /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/gpt4-2024-03-20-11-18-29
--tensorboard_dir /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/tensorboard/gpt4-2024-03-20-11-18-29
--config_file /root/ColossalAI/ColossalAI/applications/Colossal-LLaMA-2/output/config/gpt4-2024-03-20-11-18-29.json
--num_epochs 1
--accumulation_steps 8
--micro_batch_size 1
--lr 5e-5
--grad_clip 1.0
--weight_decay 0.01
--warmup_steps 100
--use_grad_checkpoint
--use_flash_attn
--use_neft

Same error report

yangzhipeng1108 avatar Mar 28 '24 02:03 yangzhipeng1108