[Bug] 2nd_finetune for internvl2.5 using lora ('resource_tracker: There appear to be %d ')
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
the error emerge in the step: [2025-01-12 17:52:25,277] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:33,530] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:41,724] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:49,987] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-12 17:52:58,392] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 34, images per sample: 4.25, dynamic token length: 2020
Reproduction
set -x
GPUS=${GPUS:-8} BATCH_SIZE=${BATCH_SIZE:-64} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-8} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / GPUS))
export PYTHONPATH="${PYTHONPATH}:$(pwd)" export MASTER_PORT=34229 export TF_CPP_MIN_LOG_LEVEL=3 export LAUNCHER=pytorch
OUTPUT_DIR='work_dirs/internvl_chat_v2_5/internvl2_5_8b_dynamic_res_2nd_finetune_lora_0112'
if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi
number of gpus: 2
batch size per gpu: 4
gradient accumulation steps: 2
total batch size: 16
epoch: 1
torchrun
--nnodes=1
--node_rank=0
--master_addr=127.0.0.1
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_finetune.py
--model_name_or_path "/245_disk/mb_train_sft_1121/InternVL2_5-8B-MPO"
--conv_style "internvl2_5"
--use_fast_tokenizer False
--output_dir ${OUTPUT_DIR}
--meta_path "./shell/data/mb_train_sft_250110.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.0
--freeze_llm True
--freeze_mlp False
--freeze_backbone True
--use_llm_lora 8
--vision_select_layer -1
--dataloader_num_workers 16
--bf16 True
--num_train_epochs 3
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 1
--learning_rate 4e-5
--weight_decay 0.05
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 8192
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
Environment
absl-py 2.0.0
accelerate 0.33.0
adabench 1.2.64
aii-pypai 0.1.40.63
aiofiles 23.2.1
aiohttp 3.9.1
aiosignal 1.3.1
aistudio-analyzer 0.0.4.102
aistudio-checkpoint 0.1.241220
aistudio-common 0.0.28.55
aistudio-notebook 2.0.128
aistudio-serving 0.0.0.62
alipay-pcache 0.1.6
aliyun-python-sdk-core 2.14.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
ant-couler 0.0.1rc17
anyio 4.2.0
apex 0.1
archspec 0.2.1
argo-workflows 3.5.1
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
astroid 3.0.2
asttokens 2.4.1
async-timeout 4.0.3
atorch 1.2.1
attrs 23.1.0
autopep8 2.0.4
Babel 2.15.0
backcall 0.2.0
beautifulsoup4 4.12.2
bigmodelvis 0.0.1
bitarray 2.8.5
bitsandbytes 0.42.0
bleach 6.1.0
blinker 1.7.0
boltons 23.0.0
boto3 1.34.2
botocore 1.34.2
Brotli 1.0.9
cachetools 3.1.1
cattrs 23.2.3
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 2.0.4
cheroot 10.0.0
click 6.7
click-config-file 0.6.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.1
conda 23.11.0
conda-content-trust 0.2.0
conda-libmamba-solver 23.12.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
configobj 5.0.8
configparser 6.0.0
contourpy 1.1.1
couler-core 0.1.1rc11
crcmod 1.7
cryptography 41.0.7
cycler 0.12.1
Cython 3.0.6
datasets 2.15.0
debugpy 1.8.0
decorator 5.1.1
decord 0.6.0
deepspeed 0.15.4
defusedxml 0.7.1
delta-center-client 0.0.4
Deprecated 1.2.14
deprecation 2.1.0
dill 0.3.7
distlib 0.3.8
distro 1.8.0
dlrover 0.3.6
docker 4.1.0
docstring-to-markdown 0.13
easydl-sdk 0.0.6
einops 0.7.0
entrypoints 0.4
evaluate 0.4.0
exceptiongroup 1.2.0
executing 2.0.1
fairscale 0.4.1
fastapi 0.108.0
fastjsonschema 2.19.1
fastmoe 1.0.0
fasttext 0.9.2
fe 0.3.33
ffmpy 0.3.1
filelock 3.13.1
flake8 6.1.0
flash-attn 2.0.4
flash-attn-1 0.2.6.post2
Flask 3.0.0
fonttools 4.46.0
fqdn 1.5.1
frozenlist 1.4.1
fsspec 2023.10.0
ftfy 6.1.3
gitdb 4.0.11
GitPython 3.1.40
google-auth 2.25.2
google-auth-oauthlib 0.4.6
gradio 4.13.0
gradio_client 0.8.0
grpcio 1.34.1
grpcio-tools 1.34.1
h11 0.14.0
hjson 3.1.0
httpcore 1.0.2
httpx 0.26.0
huggingface-hub 0.27.0
icetk 0.0.7
idna 3.4
imageio 2.35.1
importlib-metadata 7.0.0
importlib-resources 6.1.1
iniconfig 2.0.0
ipykernel 6.29.5
ipython 8.12.3
ipython-genutils 0.2.0
isodate 0.6.1
isoduration 20.11.0
isort 5.13.2
itsdangerous 2.1.2
jaraco.functools 4.0.0
jedi 0.19.1
jedi-language-server 0.41.2
Jinja2 2.11.3
jinjasql 0.1.8
jmespath 0.10.0
joblib 1.3.2
json5 0.9.25
jsonpatch 1.32
jsonpath-ng 1.6.0
jsonpointer 2.1
jsonschema 4.20.0
jsonschema-specifications 2023.11.2
jupyter_client 8.6.0
jupyter_core 5.7.1
jupyter-events 0.9.0
jupyter-lsp 2.2.5
jupyter_server 2.14.2
jupyter_server_terminals 0.5.1
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
kiwisolver 1.4.5
kmitool 0.0.9
kubemaker 0.2.17
kubernetes 9.0.0
langdetect 1.0.9
libmambapy 1.5.3
libro 0.1.11
loralib 0.1.1
lsh 0.1.2
lsprotocol 2023.0.0
lxml 4.9.3
M2Crypto 0.38.0
Markdown 3.5.1
markdown-it-py 3.0.0
MarkupSafe 1.1.1
marshmallow 3.20.1
matplotlib 3.7.4
matplotlib-inline 0.1.6
mccabe 0.7.0
mdurl 0.1.2
megatron.core 0.1
menuinst 2.0.1
mistune 0.8.4
mock 5.1.0
more-itertools 10.1.0
mpi4py 3.1.5
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.4
multiprocess 0.70.15
nbclient 0.5.13
nbconvert 6.4.4
nbformat 5.9.2
nest-asyncio 1.5.8
networkx 3.0
ninja 1.11.1.1
nltk 3.8.1
notebook 6.4.6
numpy 1.23.5
nvidia-cublas-cu12 12.4.5.8
nvidia-ml-py 12.560.30
oauthlib 3.2.2
odps 3.5.1
opencv-python 4.10.0.84
opendelta 0.3.2
orjson 3.9.10
oss2 2.6.0
osscmd 0.4.5
overrides 7.7.0
packaging 23.1
pandas 1.0.0
pandocfilters 1.5.0
parameterized 0.9.0
parso 0.8.3
pathos 0.3.0
peft 0.3.0
peppercorn 0.6
pexpect 4.9.0
pickleshare 0.7.5
Pillow 9.3.0
pip 23.3.1
pkgutil_resolve_name 1.3.10
platformdirs 3.10.0
pluggy 1.0.0
ply 3.11
pox 0.3.3
ppft 1.7.6.7
prettytable 3.9.0
prometheus-client 0.19.0
prompt-toolkit 3.0.43
protobuf 3.20.0
psutil 5.9.6
PTable 0.9.2
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
py-cpuinfo 9.0.0
py-spy 0.3.14
pyaml 21.10.1
pyarrow 12.0.0
pyarrow-hotfix 0.6
pyasn1 0.5.1
pyasn1-modules 0.3.0
pybind11 2.11.1
pycodestyle 2.11.1
pycosat 0.6.6
pycparser 2.21
pycryptodome 3.19.0
pydantic 2.10.4
pydantic_core 2.27.2
pyDes 2.0.1
pydocstyle 6.3.0
pydub 0.25.1
pyflakes 3.1.0
pygls 1.2.1
Pygments 2.17.2
pyhocon 0.3.60
pyinotify 0.9.6
pylint 3.0.3
pynvml 11.4.1
pyodps 0.11.4.1
Pyomo 6.7.0
pyOpenSSL 23.2.0
pyparsing 3.1.1
PySocks 1.7.1
pytest 7.4.3
python-dateutil 2.8.2
python-json-logger 2.0.7
python-lsp-jsonrpc 1.1.2
python-lsp-server 1.9.0
python-multipart 0.0.6
pytoolconfig 1.2.6
pytz 2023.3.post1
PyWavelets 1.4.1
PyYAML 6.0.1
pyzmq 25.1.2
ray 2.9.0
referencing 0.32.0
regex 2023.10.3
requests 2.31.0
requests-file 1.5.1
requests-oauthlib 1.3.1
requests-toolbelt 1.0.0
responses 0.18.0
retry 0.9.2
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.7.0
rope 1.11.0
rouge-chinese 1.0.3
rouge-score 0.1.2
rpds-py 0.14.1
rsa 4.9
ruamel.yaml 0.16.10
ruamel.yaml.clib 0.2.6
ruff 0.1.11
ruff-lsp 0.0.54
s3transfer 0.9.0
safetensors 0.4.1
scikit-learn 1.3.2
scipy 1.10.1
semantic-version 2.10.0
Send2Trash 1.8.2
sentencepiece 0.1.97
setuptools 68.2.2
shellingham 1.5.4
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
snowballstemmer 2.2.0
soupsieve 2.5
sqlparse 0.4.4
stack-data 0.6.3
starlette 0.32.0.post1
stringcase 1.2.0
StringGenerator 0.4.4
sympy 1.12
tabulate 0.8.2
tensorboard 2.11.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.6
termcolor 2.4.0
terminado 0.18.0
testpath 0.6.0
threadpoolctl 3.2.0
timm 1.0.9
tinycss2 1.2.1
titans 0.0.7
tldextract 5.1.1
tokenizers 0.19.1
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.0
torch 2.1.0+cu121
torchaudio 2.1.0+cu121
torchpippy 0.1.1+cecc4fc
torchvision 0.16.0+cu121
tornado 6.4
tqdm 4.65.0
traitlets 5.14.1
transformers 4.40.0
triton 2.1.0
typer 0.9.0
types-python-dateutil 2.8.19.20240106
typing_extensions 4.12.2
tzdata 2023.3
ujson 5.9.0
uncertainty-calibration 0.1.4
Unidecode 1.3.7
unifile-sdk 0.1.14
uri-template 1.3.0
urllib3 1.26.18
uvicorn 0.25.0
virtualenv 20.25.0
watchdog 2.3.1
wcwidth 0.2.12
web.py 0.62
webcolors 1.13
webencodings 0.5.1
websocket-client 1.7.0
websockets 11.0.3
Werkzeug 3.0.1
wfbuilder 1.0.56.43
wget 3.2
whatthepatch 1.0.5
wheel 0.41.2
wrapt 1.16.0
xattr 1.0.0
xxhash 3.4.1
yacs 0.1.8
yapf 0.40.2
yarl 1.9.4
zdfs-dfs 2.3.2
zeep 4.2.1
zipp 3.17.0
zstandard 0.19.0
Error traceback
Discovered apex.normalization.FusedRMSNorm - will use it instead of InternLM2RMSNorm
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-01-12 17:53:14,141] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 36605) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
internvl/train/internvl_chat_finetune.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-12_17:53:14
host : gpulingjun033184121099.sa127
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 36605)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 36605
=====================================================
/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 57 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
any guys?
我也遇到了这个问题请问你是怎么解决的?
Hi,
How many GPUs did you use for fine-tuning? Your srun command is set to 8(GPUS=${GPUS:-8}), but the displayed GPUS is 2(number of gpus: 2), which seems to be a problem with the distributed command.
have you solved yet?
same question, how do you solve it?
In my case, this message is because the host machine ran out of memory. To solve it, I increased the amount on the machine running fine-tuning