Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

the error emerge in the step: [2025-01-12 17:52:25,277] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:33,530] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:41,724] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:49,987] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:58,392] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 2020

Reproduction

set -x

GPUS=${GPUS:-8} BATCH_SIZE=${BATCH_SIZE:-64} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-8} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))

export PYTHONPATH="${PYTHONPATH}:$(pwd)" export MASTER_PORT=34229 export TF_CPP_MIN_LOG_LEVEL=3 export LAUNCHER=pytorch

OUTPUT_DIR='work_dirs/internvl_chat_v2_5/internvl2_5_8b_dynamic_res_2nd_finetune_lora_0112'

if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi

number of gpus: 2

batch size per gpu: 4

gradient accumulation steps: 2

total batch size: 16

epoch: 1

torchrun
--nnodes=1
--node_rank=0
--master_addr=127.0.0.1
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_finetune.py
--model_name_or_path "/245_disk/mb_train_sft_1121/InternVL2_5-8B-MPO"
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path "./shell/data/mb_train_sft_250110.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.0
--freeze_llm True
--freeze_mlp False
--freeze_backbone True
--use_llm_lora 8
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 1
--learning_rate 4e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 8192
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Environment

absl-py                   2.0.0
accelerate                0.33.0
adabench                  1.2.64
aii-pypai                 0.1.40.63
aiofiles                  23.2.1
aiohttp                   3.9.1
aiosignal                 1.3.1
aistudio-analyzer         0.0.4.102
aistudio-checkpoint       0.1.241220
aistudio-common           0.0.28.55
aistudio-notebook         2.0.128
aistudio-serving          0.0.0.62
alipay-pcache             0.1.6
aliyun-python-sdk-core    2.14.0
aliyun-python-sdk-kms     2.16.2
altair                    5.2.0
annotated-types           0.6.0
ant-couler                0.0.1rc17
anyio                     4.2.0
apex                      0.1
archspec                  0.2.1
argo-workflows            3.5.1
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
astroid                   3.0.2
asttokens                 2.4.1
async-timeout             4.0.3
atorch                    1.2.1
attrs                     23.1.0
autopep8                  2.0.4
Babel                     2.15.0
backcall                  0.2.0
beautifulsoup4            4.12.2
bigmodelvis               0.0.1
bitarray                  2.8.5
bitsandbytes              0.42.0
bleach                    6.1.0
blinker                   1.7.0
boltons                   23.0.0
boto3                     1.34.2
botocore                  1.34.2
Brotli                    1.0.9
cachetools                3.1.1
cattrs                    23.2.3
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        2.0.4
cheroot                   10.0.0
click                     6.7
click-config-file         0.6.0
cloudpickle               3.0.0
colorama                  0.4.6
comm                      0.2.1
conda                     23.11.0
conda-content-trust       0.2.0
conda-libmamba-solver     23.12.0
conda-package-handling    2.2.0
conda_package_streaming   0.9.0
configobj                 5.0.8
configparser              6.0.0
contourpy                 1.1.1
couler-core               0.1.1rc11
crcmod                    1.7
cryptography              41.0.7
cycler                    0.12.1
Cython                    3.0.6
datasets                  2.15.0
debugpy                   1.8.0
decorator                 5.1.1
decord                    0.6.0
deepspeed                 0.15.4
defusedxml                0.7.1
delta-center-client       0.0.4
Deprecated                1.2.14
deprecation               2.1.0
dill                      0.3.7
distlib                   0.3.8
distro                    1.8.0
dlrover                   0.3.6
docker                    4.1.0
docstring-to-markdown     0.13
easydl-sdk                0.0.6
einops                    0.7.0
entrypoints               0.4
evaluate                  0.4.0
exceptiongroup            1.2.0
executing                 2.0.1
fairscale                 0.4.1
fastapi                   0.108.0
fastjsonschema            2.19.1
fastmoe                   1.0.0
fasttext                  0.9.2
fe                        0.3.33
ffmpy                     0.3.1
filelock                  3.13.1
flake8                    6.1.0
flash-attn                2.0.4
flash-attn-1              0.2.6.post2
Flask                     3.0.0
fonttools                 4.46.0
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2023.10.0
ftfy                      6.1.3
gitdb                     4.0.11
GitPython                 3.1.40
google-auth               2.25.2
google-auth-oauthlib      0.4.6
gradio                    4.13.0
gradio_client             0.8.0
grpcio                    1.34.1
grpcio-tools              1.34.1
h11                       0.14.0
hjson                     3.1.0
httpcore                  1.0.2
httpx                     0.26.0
huggingface-hub           0.27.0
icetk                     0.0.7
idna                      3.4
imageio                   2.35.1
importlib-metadata        7.0.0
importlib-resources       6.1.1
iniconfig                 2.0.0
ipykernel                 6.29.5
ipython                   8.12.3
ipython-genutils          0.2.0
isodate                   0.6.1
isoduration               20.11.0
isort                     5.13.2
itsdangerous              2.1.2
jaraco.functools          4.0.0
jedi                      0.19.1
jedi-language-server      0.41.2
Jinja2                    2.11.3
jinjasql                  0.1.8
jmespath                  0.10.0
joblib                    1.3.2
json5                     0.9.25
jsonpatch                 1.32
jsonpath-ng               1.6.0
jsonpointer               2.1
jsonschema                4.20.0
jsonschema-specifications 2023.11.2
jupyter_client            8.6.0
jupyter_core              5.7.1
jupyter-events            0.9.0
jupyter-lsp               2.2.5
jupyter_server            2.14.2
jupyter_server_terminals  0.5.1
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
kiwisolver                1.4.5
kmitool                   0.0.9
kubemaker                 0.2.17
kubernetes                9.0.0
langdetect                1.0.9
libmambapy                1.5.3
libro                     0.1.11
loralib                   0.1.1
lsh                       0.1.2
lsprotocol                2023.0.0
lxml                      4.9.3
M2Crypto                  0.38.0
Markdown                  3.5.1
markdown-it-py            3.0.0
MarkupSafe                1.1.1
marshmallow               3.20.1
matplotlib                3.7.4
matplotlib-inline         0.1.6
mccabe                    0.7.0
mdurl                     0.1.2
megatron.core             0.1
menuinst                  2.0.1
mistune                   0.8.4
mock                      5.1.0
more-itertools            10.1.0
mpi4py                    3.1.5
mpmath                    1.3.0
msgpack                   1.0.7
multidict                 6.0.4
multiprocess              0.70.15
nbclient                  0.5.13
nbconvert                 6.4.4
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.0
ninja                     1.11.1.1
nltk                      3.8.1
notebook                  6.4.6
numpy                     1.23.5
nvidia-cublas-cu12        12.4.5.8
nvidia-ml-py              12.560.30
oauthlib                  3.2.2
odps                      3.5.1
opencv-python             4.10.0.84
opendelta                 0.3.2
orjson                    3.9.10
oss2                      2.6.0
osscmd                    0.4.5
overrides                 7.7.0
packaging                 23.1
pandas                    1.0.0
pandocfilters             1.5.0
parameterized             0.9.0
parso                     0.8.3
pathos                    0.3.0
peft                      0.3.0
peppercorn                0.6
pexpect                   4.9.0
pickleshare               0.7.5
Pillow                    9.3.0
pip                       23.3.1
pkgutil_resolve_name      1.3.10
platformdirs              3.10.0
pluggy                    1.0.0
ply                       3.11
pox                       0.3.3
ppft                      1.7.6.7
prettytable               3.9.0
prometheus-client         0.19.0
prompt-toolkit            3.0.43
protobuf                  3.20.0
psutil                    5.9.6
PTable                    0.9.2
ptyprocess                0.7.0
pure-eval                 0.2.2
py                        1.11.0
py-cpuinfo                9.0.0
py-spy                    0.3.14
pyaml                     21.10.1
pyarrow                   12.0.0
pyarrow-hotfix            0.6
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pybind11                  2.11.1
pycodestyle               2.11.1
pycosat                   0.6.6
pycparser                 2.21
pycryptodome              3.19.0
pydantic                  2.10.4
pydantic_core             2.27.2
pyDes                     2.0.1
pydocstyle                6.3.0
pydub                     0.25.1
pyflakes                  3.1.0
pygls                     1.2.1
Pygments                  2.17.2
pyhocon                   0.3.60
pyinotify                 0.9.6
pylint                    3.0.3
pynvml                    11.4.1
pyodps                    0.11.4.1
Pyomo                     6.7.0
pyOpenSSL                 23.2.0
pyparsing                 3.1.1
PySocks                   1.7.1
pytest                    7.4.3
python-dateutil           2.8.2
python-json-logger        2.0.7
python-lsp-jsonrpc        1.1.2
python-lsp-server         1.9.0
python-multipart          0.0.6
pytoolconfig              1.2.6
pytz                      2023.3.post1
PyWavelets                1.4.1
PyYAML                    6.0.1
pyzmq                     25.1.2
ray                       2.9.0
referencing               0.32.0
regex                     2023.10.3
requests                  2.31.0
requests-file             1.5.1
requests-oauthlib         1.3.1
requests-toolbelt         1.0.0
responses                 0.18.0
retry                     0.9.2
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.7.0
rope                      1.11.0
rouge-chinese             1.0.3
rouge-score               0.1.2
rpds-py                   0.14.1
rsa                       4.9
ruamel.yaml               0.16.10
ruamel.yaml.clib          0.2.6
ruff                      0.1.11
ruff-lsp                  0.0.54
s3transfer                0.9.0
safetensors               0.4.1
scikit-learn              1.3.2
scipy                     1.10.1
semantic-version          2.10.0
Send2Trash                1.8.2
sentencepiece             0.1.97
setuptools                68.2.2
shellingham               1.5.4
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
snowballstemmer           2.2.0
soupsieve                 2.5
sqlparse                  0.4.4
stack-data                0.6.3
starlette                 0.32.0.post1
stringcase                1.2.0
StringGenerator           0.4.4
sympy                     1.12
tabulate                  0.8.2
tensorboard               2.11.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorboardX              2.6
termcolor                 2.4.0
terminado                 0.18.0
testpath                  0.6.0
threadpoolctl             3.2.0
timm                      1.0.9
tinycss2                  1.2.1
titans                    0.0.7
tldextract                5.1.1
tokenizers                0.19.1
tomli                     2.0.1
tomlkit                   0.12.0
toolz                     0.12.0
torch                     2.1.0+cu121
torchaudio                2.1.0+cu121
torchpippy                0.1.1+cecc4fc
torchvision               0.16.0+cu121
tornado                   6.4
tqdm                      4.65.0
traitlets                 5.14.1
transformers              4.40.0
triton                    2.1.0
typer                     0.9.0
types-python-dateutil     2.8.19.20240106
typing_extensions         4.12.2
tzdata                    2023.3
ujson                     5.9.0
uncertainty-calibration   0.1.4
Unidecode                 1.3.7
unifile-sdk               0.1.14
uri-template              1.3.0
urllib3                   1.26.18
uvicorn                   0.25.0
virtualenv                20.25.0
watchdog                  2.3.1
wcwidth                   0.2.12
web.py                    0.62
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
websockets                11.0.3
Werkzeug                  3.0.1
wfbuilder                 1.0.56.43
wget                      3.2
whatthepatch              1.0.5
wheel                     0.41.2
wrapt                     1.16.0
xattr                     1.0.0
xxhash                    3.4.1
yacs                      0.1.8
yapf                      0.40.2
yarl                      1.9.4
zdfs-dfs                  2.3.2
zeep                      4.2.1
zipp                      3.17.0
zstandard                 0.19.0

Error traceback

Discovered apex.normalization.FusedRMSNorm - will use it instead of InternLM2RMSNorm
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-01-12 17:53:14,141] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 36605) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
internvl/train/internvl_chat_finetune.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-12_17:53:14
  host      : gpulingjun033184121099.sa127
  rank      : 0 (local_rank: 0)
  exitcode  : -8 (pid: 36605)
  error_file: <N/A>
  traceback : Signal 8 (SIGFPE) received by PID 36605
=====================================================
/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 57 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Jan 12 '25 10:01 lzk9508

any guys?

Jan 13 '25 09:01 lzk9508

我也遇到了这个问题请问你是怎么解决的？

Feb 18 '25 15:02 zhouyang2002

Hi,

How many GPUs did you use for fine-tuning? Your srun command is set to 8(GPUS=${GPUS:-8}), but the displayed GPUS is 2(number of gpus: 2), which seems to be a problem with the distributed command.