Segmentation fault (SIGSEGV, exit code -11) while running internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora.sh

Open NmTamil2 opened this issue 9 months ago • 1 comments

Hello,

I have 2 x A100 GPUs with a total of 160GB VRAM. I attempted to fine-tune the OpenGVLab/InternVL-Chat-V1-5 model by following the steps provided in the InternVL documentation. However, when I ran the script in shell mode, I encountered the following error.

(ft-env) root@07c329c33692:/workspace/ft_InternVL/InternVL/internvl_chat# GPUS=2 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora.sh
+ GPUS=2
+ BATCH_SIZE=16
+ PER_DEVICE_BATCH_SIZE=2
+ GRADIENT_ACC=4
+ pwd
+ export PYTHONPATH=:/workspace/ft_InternVL/InternVL/internvl_chat
+ export MASTER_PORT=34229
+ export TF_CPP_MIN_LOG_LEVEL=3
+ export LAUNCHER=pytorch
+ OUTPUT_DIR=work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora
+ [ ! -d work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora ]
+ mkdir -p work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=2 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path /workspace/huggingface_cache/hub/models--OpenGVLab--InternVL-Chat-V1-5 --conv_style internlm2-chat --output_dir work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora --meta_path ./shell/data/internvl_1_2_finetune_custom.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 12 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0.05 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 4096 --do_train True+  --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage3_config.json --report_to tensorboard
tee -a work_dirs/internvl_chat_v1_5/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora/training_log.txt
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] 
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] *****************************************
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0318 06:16:30.522000 3264 torch/distributed/run.py:792] *****************************************
[2025-03-18 06:16:36,811] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-18 06:16:37,037] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0318 06:16:37.379000 3264 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3407 closing signal SIGTERM
E0318 06:16:37.598000 3264 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 3406) of binary: /workspace/ft_InternVL/ft-env/bin/python
Traceback (most recent call last):
  File "/workspace/ft_InternVL/ft-env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-18_06:16:37
  host      : 07c329c33692
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 3406)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 3406
======================================================

Mar 18 '25 06:03 NmTamil2

@czczup @Weiyun1025 I am having the same problem. Can you please help to fix this?

Mar 21 '25 11:03 ZenithWisp