InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Open zhouyang2002 opened this issue 10 months ago • 5 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am conducting fine-tuning experiments, but I encountered an environment issue. I tried using Python versions 3.9 and 3.11, and configured the conda environment according to the requirements/internvl_chat.txt file as given in the official Installation section. After running it, I encountered the following error.

Reproduction

GPUS=4 PER_DEVICE_BATCH_SIZE=1 sh shell/internvl2.5/2nd_finetune/internvl2_5_1b_dynamic_res_2nd_finetune_full.sh

Environment

### transformers == 4.37.2

### flash-attn == 2.3.6

### conda list

_libgcc_mutex             0.1                        main    defaults
_openmp_mutex             5.1                       1_gnu    defaults
accelerate                0.34.2                   pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
bitsandbytes              0.42.0                   pypi_0    pypi
bzip2                     1.0.8                h5eee18b_6    defaults
ca-certificates           2024.12.31           h06a4308_0    defaults
certifi                   2025.1.31                pypi_0    pypi
charset-normalizer        3.4.1                    pypi_0    pypi
contourpy                 1.3.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
decord                    0.6.0                    pypi_0    pypi
deepspeed                 0.16.3                   pypi_0    pypi
einops                    0.6.1                    pypi_0    pypi
einops-exts               0.0.4                    pypi_0    pypi
filelock                  3.17.0                   pypi_0    pypi
flash-attn                2.3.6                    pypi_0    pypi
fonttools                 4.56.0                   pypi_0    pypi
fsspec                    2025.2.0                 pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
huggingface-hub           0.29.1                   pypi_0    pypi
idna                      3.10                     pypi_0    pypi
imageio                   2.37.0                   pypi_0    pypi
jinja2                    3.1.5                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
kiwisolver                1.4.8                    pypi_0    pypi
ld_impl_linux-64          2.40                 h12ee557_0    defaults
libffi                    3.4.4                h6a678d5_1    defaults
libgcc-ng                 11.2.0               h1234567_1    defaults
libgomp                   11.2.0               h1234567_1    defaults
libstdcxx-ng              11.2.0               h1234567_1    defaults
libuuid                   1.41.5               h5eee18b_0    defaults
markupsafe                3.0.2                    pypi_0    pypi
matplotlib                3.10.0                   pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
msgpack                   1.1.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    defaults
networkx                  3.4.2                    pypi_0    pypi
ninja                     1.11.1.3                 pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
opencv-python             4.11.0.86                pypi_0    pypi
openssl                   3.0.15               h5eee18b_0    defaults
orjson                    3.10.15                  pypi_0    pypi
packaging                 24.2                     pypi_0    pypi
peft                      0.10.0                   pypi_0    pypi
pillow                    11.1.0                   pypi_0    pypi
pip                       25.0                     pypi_0    pypi
protobuf                  5.29.3                   pypi_0    pypi
psutil                    7.0.0                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pycocoevalcap             1.2                      pypi_0    pypi
pycocotools               2.0.8                    pypi_0    pypi
pydantic                  2.10.6                   pypi_0    pypi
pydantic-core             2.27.2                   pypi_0    pypi
pyparsing                 3.2.1                    pypi_0    pypi
python                    3.11.11              he870216_0    defaults
python-dateutil           2.9.0.post0              pypi_0    pypi
pyyaml                    6.0.2                    pypi_0    pypi
readline                  8.2                  h5eee18b_0    defaults
regex                     2024.11.6                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
safetensors               0.5.2                    pypi_0    pypi
scikit-learn              1.6.1                    pypi_0    pypi
scipy                     1.15.2                   pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                75.8.0                   pypi_0    pypi
shortuuid                 1.0.13                   pypi_0    pypi
six                       1.17.0                   pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0    defaults
sympy                     1.13.1                   pypi_0    pypi
tensorboardx              2.6.2.2                  pypi_0    pypi
termcolor                 2.5.0                    pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
timm                      0.9.12                   pypi_0    pypi
tk                        8.6.14               h39e8969_0    defaults
tokenizers                0.15.1                   pypi_0    pypi
torch                     2.6.0                    pypi_0    pypi
torchvision               0.21.0                   pypi_0    pypi
tqdm                      4.67.1                   pypi_0    pypi
transformers              4.37.2                   pypi_0    pypi
triton                    3.2.0                    pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
tzdata                    2025a                h04d1e81_0    defaults
urllib3                   2.3.0                    pypi_0    pypi
wheel                     0.45.1                   pypi_0    pypi
xz                        5.6.4                h5eee18b_1    defaults
yacs                      0.1.8                    pypi_0    pypi
zlib                      1.2.13               h5eee18b_1    defaults

Error traceback

+ GPUS=4
+ BATCH_SIZE=128
+ PER_DEVICE_BATCH_SIZE=1
+ GRADIENT_ACC=32
+ pwd
+ export PYTHONPATH=:/home/disk/InternVL/internvl_chat
+ export MASTER_PORT=34229
+ export TF_CPP_MIN_LOG_LEVEL=3
+ export LAUNCHER=pytorch
+ OUTPUT_DIR=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full
+ [ ! -d work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full ]
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=4 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path /home/disk/InternVL/pretrained/InternVL2_5-1B --conv_style internvl2_5 --use_fast_tokenizer False --output_dir work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full --meta_path /home/disk/InternVL/internvl_chat/shell/data/train.json+  --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.1 --freeze_llm False --freeze_mlp False --freeze_backbone True --vision_select_layer -1tee --dataloader_num_workers -a 4 work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_full/training_log.txt --bf16
 True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 32 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 8192 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage3_config.json --report_to tensorboard
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] 
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] *****************************************
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0220 19:13:17.106000 434973 site-packages/torch/distributed/run.py:792] *****************************************
[2025-02-20 19:13:19,847] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-20 19:13:19,854] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-20 19:13:19,886] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-20 19:13:19,930] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0220 19:13:20.625000 434973 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 435063 closing signal SIGTERM
W0220 19:13:20.626000 434973 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 435066 closing signal SIGTERM
E0220 19:13:20.740000 434973 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 1 (pid: 435064) of binary: /opt/conda/envs/123/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/123/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/123/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
internvl/train/internvl_chat_finetune.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2025-02-20_19:13:20
  host      : bdf886c8e7f2
  rank      : 2 (local_rank: 2)
  exitcode  : -11 (pid: 435065)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 435065
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-20_19:13:20
  host      : bdf886c8e7f2
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 435064)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 435064
========================================================

zhouyang2002 avatar Feb 20 '25 11:02 zhouyang2002

same

Ziyu-Jin avatar Mar 06 '25 06:03 Ziyu-Jin

@zhouyang2002 @Ziyu-Jin Were you able to resolve this?

ZenithWisp avatar Mar 19 '25 11:03 ZenithWisp

@zhouyang2002 @Ziyu-Jin Were you able to resolve this?

I think it might be a problem with the my conda env. After I reinstalled the environment, it doesn't seem to report errors.

Ziyu-Jin avatar Mar 19 '25 11:03 Ziyu-Jin

@Ziyu-Jin Can you please share your conda list?

ZenithWisp avatar Mar 19 '25 11:03 ZenithWisp

@zhouyang2002 @ZenithWisp Have you solved this problem?

LuyangJ avatar Apr 13 '25 09:04 LuyangJ