MiniLLM LLaMA-Torch Shape Mismatch for Model Parallel
I am checking to see if I am doing this correctly.
I have downloaded the LLaMA2-13B from HuggingFace (https://huggingface.co/meta-llama/Llama-2-13b-hf). I have generated a weight configuration file named "llama2.json," which is the same as the "mp_weight_configs/llama.json." Next, I have converted the model for model parallelism as follows: python3 tools/convert_mp.py --input_path checkpoint-for-llama2-13b --source_mp_size 1 --target_mp_size 2 --model_type llama2
I generated a script "sft_13B_mp2.sh" following the script at "scripts/llama2/sft/sft_7B_mp4.sh". I have run the script at "scripts/llama2/sft/sft_13B_mp2.sh" using MP_SIZE=2. But I am getting the following error:
"ValueError: Trying to set a tensor of shape torch.Size([16000, 5120]) in "weight" (which has shape torch.Size([32000, 5120])), this looks incorrect."
I would really appreciate it if you could tell me if I am missing anything.
Hi, it seems that the checkpoint is successfully parallelized because the original 32000 vocab size becomes 16000. However, the model loading the checkpoint does not apply model parallelism (the vocab size is still 32000). Could you please share the scripts/llama2/sft/sft_13B_mp2.sh script such that I can check which part is missing?
This is the script that I used:
#! /bin/bash
MASTER_ADDR=localhost MASTER_PORT=${2-2012} NNODES=1 NODE_RANK=0 GPUS_PER_NODE=${3-16}
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT"
model
BASE_PATH=${1-"/home/MiniLLM"} CKPT_NAME="llama2-13B" CKPT="${BASE_PATH}/checkpoints/${CKPT_NAME}/" MP_SIZE=2
data
DATA_DIR="${BASE_PATH}/processed_data/dolly/full/llama2/"
hp
BATCH_SIZE=4 LR=0.00001 GRAD_ACC=2 EVAL_BATCH_SIZE=8
length
MAX_LENGTH=512
runtime
SAVE_PATH="${BASE_PATH}/results/llama2/train/sft"
seed
SEED=10 SEED_ORDER=10
OPTS=""
model
OPTS+=" --base-path ${BASE_PATH}" OPTS+=" --model-path ${CKPT}" OPTS+=" --ckpt-name ${CKPT_NAME}" OPTS+=" --n-gpu ${GPUS_PER_NODE}" OPTS+=" --model-type llama2" OPTS+=" --gradient-checkpointing" OPTS+=" --model-parallel" OPTS+=" --model-parallel-size ${MP_SIZE}"
data
OPTS+=" --data-dir ${DATA_DIR}" OPTS+=" --num-workers 0" OPTS+=" --dev-num 500"
hp
OPTS+=" --lr ${LR}" OPTS+=" --batch-size ${BATCH_SIZE}" OPTS+=" --eval-batch-size ${EVAL_BATCH_SIZE}" OPTS+=" --gradient-accumulation-steps ${GRAD_ACC}" OPTS+=" --warmup-iters 0" OPTS+=" --lr-decay-style cosine" OPTS+=" --weight-decay 1e-2" OPTS+=" --clip-grad 1.0" OPTS+=" --epochs 10"
length
OPTS+=" --max-length ${MAX_LENGTH}" OPTS+=" --max-prompt-length 256"
runtime
OPTS+=" --do-train" OPTS+=" --do-valid" OPTS+=" --eval-gen" OPTS+=" --save-interval -1" OPTS+=" --eval-interval -1" OPTS+=" --log-interval 4" OPTS+=" --mid-log-num 1" OPTS+=" --save ${SAVE_PATH}"
seed
OPTS+=" --seed ${SEED}" OPTS+=" --seed-order ${SEED_ORDER}"
deepspeed
OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config_zero2_fp16.json"
type
OPTS+=" --type lm"
gen
OPTS+=" --do-sample" OPTS+=" --top-k 0" OPTS+=" --top-p 1.0" OPTS+=" --temperature 1.0"
export NCCL_DEBUG="" export WANDB_DISABLED=True export TF_CPP_MIN_LOG_LEVEL=3 export PYTHONPATH=${BASE_PATH} CMD="torchrun ${DISTRIBUTED_ARGS} ${BASE_PATH}/finetune.py ${OPTS} $@"
echo ${CMD} echo "PYTHONPATH=${PYTHONPATH}" mkdir -p ${SAVE_PATH} ${CMD}
I successfully run this script locally. Did you install our modified transformers library which implements model parallelism?
pip3 install git+https://github.com/t1101675/transformers@minillm
Yes. I tried running on a clean environment. But still getting memory errors like the following:
"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"
This is because the memory of your GPU is not large enough. You can try a larger model parallelism size (like 4).
I tried using MP_SIZE = 4, but I'm still getting the "out of memory" error. Do you have any suggestions in this regard?
Interestingly, I successfully ran the script for "LLaMA2-7B" with MP_SIZE = 4.
Interestingly, I successfully ran the script for "LLaMA2-7B" with MP_SIZE = 4.
It is expected that the 7B model has lower memory usage. To reduce the memory usage for the 13B model, you can:
- Decreasing the
BATCH_SIZEandEVAL_BATCH_SIZEhyperparamter. - Using a larger model parallel size, like MP_SIZE = 8