LLaVA-NeXT
LLaVA-NeXT copied to clipboard
High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM)
trafficstars
I'm experiencing high memory usage in the DataLoader workers when using a custom dataset class for lazy loading large datasets. This leads to Out-of-Memory (OOM) errors during training. I've observed that the MaxRSS (maximum resident set size) steadily increases during training, indicating potential memory leaks or improper memory management in the DataLoader or dataset preprocessing.
Error Message Example:
RuntimeError: DataLoader worker (pid XXXX) is killed by signal: Killed
Setup: Distributed training with 3 nodes, 4 GPUs per node Memory: 512 GB RAM
Training Configuration Here are the relevant training configurations used:
#!/bin/bash
source use_env.sh
NNODES=$SLURM_NNODES
WORLD_SIZE=$((NNODES * NUM_GPUS))
NODE_RANK=$SLURM_NODEID
MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
MASTER_PORT=12802
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
llava/train/train_mem.py \
--deepspeed scripts/zero3.json \
--model_name_or_path ${CKPT_PATH} \
--version ${PROMPT_VERSION} \
--data_path ./playground/data/llava_v1_5_mix665k_no-ocr.json \
--image_folder ./playground/data \
--pretrain_mm_mlp_adapter="./checkpoints/projectors/${BASE_RUN_NAME}/checkpoint-5/mm_projector.bin" \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres \
--image_grid_pinpoints "[(384, 768), (768, 384), (768, 768), (1152, 384), (384, 1152)]" \
--mm_patch_merge_type spatial_unpad \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir "./checkpoints/${MID_RUN_NAME}" \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--dataloader_num_workers 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True
- Could the lazy preprocessing in the dataset be failing to release memory properly?