Qwen-VL
Qwen-VL copied to clipboard
[BUG] <全量参数微调>运行finetune_ds.sh后卡住
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [X] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
运行finetune_ds.sh后卡在QWenAttention类的forward函数中的mixed_x_layer = self.c_attn(hidden_states) /usr/local/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root... /root/.cache/torch_extensions/py38_cu121/fused_adam Parameter Offload: Total persistent parameters: 1815808 in 491 params Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.6533713340759277 seconds /usr/local/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) 0%|
从日志看到已经有训练进度条,进一步debug看到卡在QWenAttention类的forward函数中的mixed_x_layer = self.c_attn(hidden_states),该函数是一个线性层,卡在nn.Linear中就再没有执行下一步了。
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:redhat7
- Python:3.8.8
- Transformers:4.31.0
- PyTorch:2.1.2+cu121
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1
备注 | Anything else?
finetune_ds.sh脚本:
#!/bin/bash
# -*- coding: utf-8 -*-
export NCCL_DEBUG=INFO
# export NCCL_P2P_DISABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
# export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
DIR=`pwd`
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6011
MODEL="models/Qwen-VL/qwen/Qwen-VL" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="Qwen-VL/assets/train_json/temp.json"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path $MODEL \
--data_path $DATA \
--bf16 True \
--fix_vit True \
--output_dir output_qwen \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed finetune/ds_config_zero3.json
No response
您好,遇到了同样的问题,请问解决了吗
全参数微调需要多少显存呀
@micsama @Waxyoung @decreasbetter @hzhwcmhf 想请问全参数微调需要什么资源呢?
也遇到了相同问题,和linux内核版本有关吗?
看到了warning:
warnings.warn(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/mnt/cache/huangzhiyuan/env/seeclick/lib/python3.11/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an accelerate.DataLoaderConfiguration
instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
Save issue here.
根据我训练时的经验,在qwen-vl中走到开始训练时卡住的原因是数据集的问题,检查一下自己的json文件,或则可以尝试只有json文件中取出几条数据再试试。