godot-dodo
godot-dodo copied to clipboard
OOM while finetuning Starcoder
I really appreciate you releasing this work. I have been trying to do something similar with the original Starcoder finetuning code but have had a variety of issues. Unfortunately, when I run this script on my own dataset (it's only around 6800 MOO verbs) I get a pretty rapid OOM on a machine with 8x A100 80gb cards. At first I thought it was because I was trying to increase max_seq_size, (I was hoping for 1024 tokens) but dropping it back to 512 gave me the same issue. I then tried reducing batch size to 1, but that also did not work and errored out with insufficient memory again. The only other thing I changed is the prompt, although I made very minor changes to that, mostly just changing the language to my own and picking different columns out of my dataset.
Here is my run.sh:
#! /usr/bin/env bash
set -e # stop on first error
set -u # stop if any variable is unbound
set -o pipefail # stop if any command in a pipe fails
LOG_FILE="output.log"
TRANSFORMERS_VERBOSITY=info
get_gpu_count() {
local gpu_count
gpu_count=$(nvidia-smi -L | wc -l)
echo "$gpu_count"
}
gpu_count=$(get_gpu_count)
echo "Number of GPUs: $gpu_count"
train() {
local script="$1"
shift 1
local script_args="$@"
if [ -z "$script" ] || [ -z "$script_args" ]; then
echo "Error: Missing arguments. Please provide the script and script_args."
return 1
fi
{ torchrun --nproc_per_node="$gpu_count" "$script" $script_args 2>&1; } | tee -a "$LOG_FILE"
}
train train.py \
--model_name_or_path "bigcode/starcoder" \
--data_path ./verbs_augmented/verbs_augmented.jsonl \
--bf16 True \
--output_dir moocoder \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard" \
--fsdp_transformer_layer_cls_to_wrap 'GPTBigCodeBlock' \
--tf32 True
Any idea what might be going wrong here/can I give you any more info to help me figure this out?