I am checking to see if I am doing this correctly.

I have downloaded the LLaMA2-13B from HuggingFace (https://huggingface.co/meta-llama/Llama-2-13b-hf). I have generated a weight configuration file named "llama2.json," which is the same as the "mp_weight_configs/llama.json." Next, I have converted the model for model parallelism as follows: python3 tools/convert_mp.py --input_path checkpoint-for-llama2-13b --source_mp_size 1 --target_mp_size 2 --model_type llama2

I generated a script "sft_13B_mp2.sh" following the script at "scripts/llama2/sft/sft_7B_mp4.sh". I have run the script at "scripts/llama2/sft/sft_13B_mp2.sh" using MP_SIZE=2. But I am getting the following error:

"ValueError: Trying to set a tensor of shape torch.Size([16000, 5120]) in "weight" (which has shape torch.Size([32000, 5120])), this looks incorrect."

I would really appreciate it if you could tell me if I am missing anything.

Mar 16 '25 20:03 majidurrahman1437

Hi, it seems that the checkpoint is successfully parallelized because the original 32000 vocab size becomes 16000. However, the model loading the checkpoint does not apply model parallelism (the vocab size is still 32000). Could you please share the scripts/llama2/sft/sft_13B_mp2.sh script such that I can check which part is missing?

Mar 16 '25 21:03 t1101675

This is the script that I used:

#! /bin/bash

MASTER_ADDR=localhost MASTER_PORT=${2-2012} NNODES=1 NODE_RANK=0 GPUS_PER_NODE=${3-16}

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT"

model

BASE_PATH=${1-"/home/MiniLLM"} CKPT_NAME="llama2-13B" CKPT="${BASE_PATH}/checkpoints/${CKPT_NAME}/" MP_SIZE=2

data

DATA_DIR="${BASE_PATH}/processed_data/dolly/full/llama2/"

hp

BATCH_SIZE=4 LR=0.00001 GRAD_ACC=2 EVAL_BATCH_SIZE=8

length

MAX_LENGTH=512

runtime

SAVE_PATH="${BASE_PATH}/results/llama2/train/sft"

seed

SEED=10 SEED_ORDER=10

OPTS=""

model

OPTS+=" --base-path ${BASE_PATH}" OPTS+=" --model-path ${CKPT}" OPTS+=" --ckpt-name ${CKPT_NAME}" OPTS+=" --n-gpu ${GPUS_PER_NODE}" OPTS+=" --model-type llama2" OPTS+=" --gradient-checkpointing" OPTS+=" --model-parallel" OPTS+=" --model-parallel-size ${MP_SIZE}"

data

OPTS+=" --data-dir ${DATA_DIR}" OPTS+=" --num-workers 0" OPTS+=" --dev-num 500"

hp

OPTS+=" --lr ${LR}" OPTS+=" --batch-size ${BATCH_SIZE}" OPTS+=" --eval-batch-size ${EVAL_BATCH_SIZE}" OPTS+=" --gradient-accumulation-steps ${GRAD_ACC}" OPTS+=" --warmup-iters 0" OPTS+=" --lr-decay-style cosine" OPTS+=" --weight-decay 1e-2" OPTS+=" --clip-grad 1.0" OPTS+=" --epochs 10"

length

OPTS+=" --max-length ${MAX_LENGTH}" OPTS+=" --max-prompt-length 256"

runtime

OPTS+=" --do-train" OPTS+=" --do-valid" OPTS+=" --eval-gen" OPTS+=" --save-interval -1" OPTS+=" --eval-interval -1" OPTS+=" --log-interval 4" OPTS+=" --mid-log-num 1" OPTS+=" --save ${SAVE_PATH}"

seed

OPTS+=" --seed ${SEED}" OPTS+=" --seed-order ${SEED_ORDER}"

deepspeed

OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config_zero2_fp16.json"

type

OPTS+=" --type lm"

gen

OPTS+=" --do-sample" OPTS+=" --top-k 0" OPTS+=" --top-p 1.0" OPTS+=" --temperature 1.0"

export NCCL_DEBUG="" export WANDB_DISABLED=True export TF_CPP_MIN_LOG_LEVEL=3 export PYTHONPATH=${BASE_PATH} CMD="torchrun ${DISTRIBUTED_ARGS} ${BASE_PATH}/finetune.py ${OPTS} $@"

echo ${CMD} echo "PYTHONPATH=${PYTHONPATH}" mkdir -p ${SAVE_PATH} ${CMD}

Mar 16 '25 21:03 majidurrahman1437

I successfully run this script locally. Did you install our modified transformers library which implements model parallelism?

pip3 install git+https://github.com/t1101675/transformers@minillm

Mar 16 '25 23:03 t1101675

Yes. I tried running on a clean environment. But still getting memory errors like the following:

"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU"

Mar 17 '25 19:03 majidurrahman1437

This is because the memory of your GPU is not large enough. You can try a larger model parallelism size (like 4).

Mar 17 '25 23:03 t1101675

I tried using MP_SIZE = 4, but I'm still getting the "out of memory" error. Do you have any suggestions in this regard?

Mar 18 '25 14:03 majidurrahman1437

Interestingly, I successfully ran the script for "LLaMA2-7B" with MP_SIZE = 4.

Mar 19 '25 13:03 majidurrahman1437

Interestingly, I successfully ran the script for "LLaMA2-7B" with MP_SIZE = 4.

It is expected that the 7B model has lower memory usage. To reduce the memory usage for the 13B model, you can:

Decreasing the BATCH_SIZE and EVAL_BATCH_SIZE hyperparamter.
Using a larger model parallel size, like MP_SIZE = 8

Mar 20 '25 13:03 t1101675

MiniLLM LLaMA-Torch Shape Mismatch for Model Parallel

model

data

hp

length

runtime

seed

model

data

hp

length

runtime

seed

deepspeed

type

gen