[Bug] Fine-tuning according to the documentation does not pass
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
According to the official documentation, the execution of GPUS=4 PER_DEVICE_BATCH_SIZE=4 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora.sh failed.
Document address: https://internvl.readthedocs.io/en/latest/tutorials/coco_caption_finetune.html
Reproduction
GPUS=4 PER_DEVICE_BATCH_SIZE=4 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora.sh
Environment
1. Python:3.9.19
2. cuda_12.4.r12.4/compiler.33961263_0
3. Driver Version: 550.54.14
4. transformers :4.37.2
5. torch:2.4.0
Error traceback
+ [ ! -d work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora ]
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=4 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path ./pretrained/InternVL2-2B --conv_style internlm2-chat --output_dir work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora --meta_path ./shell/data/internvl_1_2_finetune_custom.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps+ 200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_typetee cosine -a --logging_steps work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_lora/training_log.txt 1
--max_seq_length 4096 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779]
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] *****************************************
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0805 09:29:53.681633 123911748675072 torch/distributed/run.py:779] *****************************************
[2024-08-05 09:29:57,255] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 09:29:57,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 09:29:57,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 09:29:57,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
Traceback (most recent call last):
File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
from internvl.dist_utils import init_dist
File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
import deepspeed
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
from . import module_inject
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
from ..pipe import PipelineModule
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
from .module import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
from ..activation_checkpointing import checkpointing
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
Traceback (most recent call last):
from deepspeed.runtime.config import DeepSpeedConfig
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
from ..elasticity import (
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
from .elastic_agent import DSElasticAgent
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
from internvl.dist_utils import init_dist
File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
import deepspeed
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
from . import module_inject
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
from ..pipe import PipelineModule
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
from .module import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
from ..activation_checkpointing import checkpointing
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
from deepspeed.runtime.config import DeepSpeedConfig
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
from ..elasticity import (
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
from .elastic_agent import DSElasticAgent
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
Traceback (most recent call last):
File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
from internvl.dist_utils import init_dist
File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
import deepspeed
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
from . import module_inject
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
from ..pipe import PipelineModule
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
from .module import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
from ..activation_checkpointing import checkpointing
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
from deepspeed.runtime.config import DeepSpeedConfig
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
from ..elasticity import (
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
from .elastic_agent import DSElasticAgent
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
Traceback (most recent call last):
File "/root/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 18, in <module>
from internvl.dist_utils import init_dist
File "/root/InternVL/internvl_chat/internvl/dist_utils.py", line 6, in <module>
import deepspeed
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/__init__.py", line 26, in <module>
from . import module_inject
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
from ..pipe import PipelineModule
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
from .module import PipelineModule, LayerSpec, TiedLayerSpec
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
from ..activation_checkpointing import checkpointing
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
from deepspeed.runtime.config import DeepSpeedConfig
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 42, in <module>
from ..elasticity import (
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
from .elastic_agent import DSElasticAgent
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py)
W0805 09:29:58.163211 123911748675072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3279 closing signal SIGTERM
W0805 09:29:58.163818 123911748675072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3280 closing signal SIGTERM
W0805 09:29:58.164009 123911748675072 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3281 closing signal SIGTERM
E0805 09:29:58.243624 123911748675072 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3278) of binary: /root/miniconda3/envs/internvl/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/internvl/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
internvl/train/internvl_chat_finetune.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-05_09:29:58
host : RTX3090-18700172
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3278)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
python_version.txt All dependent package versions
Any progress on this? I'm facing the same issue
有什么进展吗?我也遇到了同样的问题
Can't solve it, I switched to Swift
I also face this problem, any replace plans? Can you help me?
有什么进展吗?我也遇到了同样的问题
Can't solve it, I switched to Swift
I also face this problem, any replace plans? Can you help me?
有什么进展吗?我也遇到了同样的问题
Can't solve it, I switched to Swift
Problem solve! You need install the previous version of torch like 2.1.0,not the latest version. Then, you also need to reemploy flash-atten after the new torch installing.
facing the same issue @Hoantrbl Could you tell me your torch, cuda and flash-attn versions? Thanks
facing the same issue @Hoantrbl Could you tell me your torch, cuda and flash-attn versions? Thanks
torch 2.1.0+cu121 torchaudio 2.1.0+cu121 torchvision 0.16.0+cu121 flash_attn 2.6.3