[Bug] torch.distributed.elastic.multiprocessing.errors.ChildFailedError
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I am conducting fine-tuning experiments, but I encountered an environment issue. I tried using Python versions 3.9 and 3.11, and configured the conda environment according to the requirements/internvl_chat.txt file as given in the official Installation section. After running it, I encountered the following error.
Reproduction
GPUS=4 PER_DEVICE_BATCH_SIZE=1 sh shell/internvl2.5/2nd_finetune/internvl2_5_1b_dynamic_res_2nd_finetune_full.sh
Environment
### transformers == 4.37.2
### flash-attn == 2.3.6
### conda list
_libgcc_mutex 0.1 main defaults
_openmp_mutex 5.1 1_gnu defaults
accelerate 0.34.2 pypi_0 pypi
annotated-types 0.7.0 pypi_0 pypi
bitsandbytes 0.42.0 pypi_0 pypi
bzip2 1.0.8 h5eee18b_6 defaults
ca-certificates 2024.12.31 h06a4308_0 defaults
certifi 2025.1.31 pypi_0 pypi
charset-normalizer 3.4.1 pypi_0 pypi
contourpy 1.3.1 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
decord 0.6.0 pypi_0 pypi
deepspeed 0.16.3 pypi_0 pypi
einops 0.6.1 pypi_0 pypi
einops-exts 0.0.4 pypi_0 pypi
filelock 3.17.0 pypi_0 pypi
flash-attn 2.3.6 pypi_0 pypi
fonttools 4.56.0 pypi_0 pypi
fsspec 2025.2.0 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
huggingface-hub 0.29.1 pypi_0 pypi
idna 3.10 pypi_0 pypi
imageio 2.37.0 pypi_0 pypi
jinja2 3.1.5 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
kiwisolver 1.4.8 pypi_0 pypi
ld_impl_linux-64 2.40 h12ee557_0 defaults
libffi 3.4.4 h6a678d5_1 defaults
libgcc-ng 11.2.0 h1234567_1 defaults
libgomp 11.2.0 h1234567_1 defaults
libstdcxx-ng 11.2.0 h1234567_1 defaults
libuuid 1.41.5 h5eee18b_0 defaults
markupsafe 3.0.2 pypi_0 pypi
matplotlib 3.10.0 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
msgpack 1.1.0 pypi_0 pypi
ncurses 6.4 h6a678d5_0 defaults
networkx 3.4.2 pypi_0 pypi
ninja 1.11.1.3 pypi_0 pypi
numpy 1.26.4 pypi_0 pypi
nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
nvidia-nccl-cu12 2.21.5 pypi_0 pypi
nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
opencv-python 4.11.0.86 pypi_0 pypi
openssl 3.0.15 h5eee18b_0 defaults
orjson 3.10.15 pypi_0 pypi
packaging 24.2 pypi_0 pypi
peft 0.10.0 pypi_0 pypi
pillow 11.1.0 pypi_0 pypi
pip 25.0 pypi_0 pypi
protobuf 5.29.3 pypi_0 pypi
psutil 7.0.0 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pycocoevalcap 1.2 pypi_0 pypi
pycocotools 2.0.8 pypi_0 pypi
pydantic 2.10.6 pypi_0 pypi
pydantic-core 2.27.2 pypi_0 pypi
pyparsing 3.2.1 pypi_0 pypi
python 3.11.11 he870216_0 defaults
python-dateutil 2.9.0.post0 pypi_0 pypi
pyyaml 6.0.2 pypi_0 pypi
readline 8.2 h5eee18b_0 defaults
regex 2024.11.6 pypi_0 pypi
requests 2.32.3 pypi_0 pypi
safetensors 0.5.2 pypi_0 pypi
scikit-learn 1.6.1 pypi_0 pypi
scipy 1.15.2 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
setuptools 75.8.0 pypi_0 pypi
shortuuid 1.0.13 pypi_0 pypi
six 1.17.0 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0 defaults
sympy 1.13.1 pypi_0 pypi
tensorboardx 2.6.2.2 pypi_0 pypi
termcolor 2.5.0 pypi_0 pypi
threadpoolctl 3.5.0 pypi_0 pypi
timm 0.9.12 pypi_0 pypi
tk 8.6.14 h39e8969_0 defaults
tokenizers 0.15.1 pypi_0 pypi
torch 2.6.0 pypi_0 pypi
torchvision 0.21.0 pypi_0 pypi
tqdm 4.67.1 pypi_0 pypi
transformers 4.37.2 pypi_0 pypi
triton 3.2.0 pypi_0 pypi
typing-extensions 4.12.2 pypi_0 pypi
tzdata 2025a h04d1e81_0 defaults
urllib3 2.3.0 pypi_0 pypi
wheel 0.45.1 pypi_0 pypi
xz 5.6.4 h5eee18b_1 defaults
yacs 0.1.8 pypi_0 pypi
zlib 1.2.13 h5eee18b_1 defaults
Error traceback
+ GPUS=4
+ BATCH_SIZE=128
+ PER_DEVICE_BATCH_SIZE=1
+ GRADIENT_ACC=32
+ pwd
+ export PYTHONPATH=:/home/disk/InternVL/internvl_chat
+ export MASTER_PORT=34229
+ export TF_CPP_MIN_LOG_LEVEL=3
+ export LAUNCHER=pytorch
+ OUTPUT_DIR=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full
+ [ ! -d work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full ]
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=4 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path /home/disk/InternVL/pretrained/InternVL2_5-1B --conv_style internvl2_5 --use_fast_tokenizer False --output_dir work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full --meta_path /home/disk/InternVL/internvl_chat/shell/data/train.json+ --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.1 --freeze_llm False --freeze_mlp False --freeze_backbone True --vision_select_layer -1tee --dataloader_num_workers -a 4 work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full/training_log.txt --bf16
True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 32 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 8192 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage3_config.json --report_to tensorboard
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792]
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] *****************************************
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] *****************************************
[2025-02-20 19:13:19,847] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-20 19:13:19,854] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-20 19:13:19,886] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-20 19:13:19,930] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0220 19:13:20.625000 434973 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 435063 closing signal SIGTERM
W0220 19:13:20.626000 434973 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 435066 closing signal SIGTERM
E0220 19:13:20.740000 434973 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 1 (pid: 435064) of binary: /opt/conda/envs/123/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/123/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
internvl/train/internvl_chat_finetune.py FAILED
--------------------------------------------------------
Failures:
[1]:
time : 2025-02-20_19:13:20
host : bdf886c8e7f2
rank : 2 (local_rank: 2)
exitcode : -11 (pid: 435065)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 435065
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-20_19:13:20
host : bdf886c8e7f2
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 435064)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 435064
========================================================
same
@zhouyang2002 @Ziyu-Jin Were you able to resolve this?
@zhouyang2002 @Ziyu-Jin Were you able to resolve this?
I think it might be a problem with the my conda env. After I reinstalled the environment, it doesn't seem to report errors.
@Ziyu-Jin Can you please share your conda list?
@zhouyang2002 @ZenithWisp Have you solved this problem?