Best practice for Qwen2-Audio
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift)
pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm]
# 安装最新的transformers(Install the latest transformers.)
pip install git+https://github.com/huggingface/transformers.git
pip install librosa
推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct
# 如果是本地路径(If it is a local path.)
CUDA_VISIBLE_DEVICES=0 swift infer \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path '<local_path>'
推理效果:(Inference result:)
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav
Yes, I can guess that you are a female in your twenties.
--------------------------------------------------
<<< <audio>
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav
每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。
--------------------------------------------------
<<< clear
<<< 你是谁
我是来自达摩院的语言模型,我叫通义千问。
使用python调用:(Using Python)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = None
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = '<audio>这段语音说了什么'
audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav']
response, history = inference(model, template, query, audios=audios)
print(f'query: {query}')
print(f'response: {response}')
# 流式(streaming)
query = '这段语音是男生还是女生'
gen = inference_stream(model, template, query, history, audios=audios)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: <audio>这段语音说了什么
response: 这段语音说的是:'今天天气真好呀'
query: 这段语音是男生还是女生
response: 男声。
history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']]
"""
显存占用:(Memory usage:)
Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b
推理效果:(Inference result)
<<< <audio>Generate the caption in English:
Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3
Glass is breaking.
微调(Fine-tuning)
通常,多模态大模型微调会使用自定义数据集进行微调。在这里,我们将展示可直接运行的demo。我们使用aishell1-zh-mini数据集进行微调,您可以在 modelscope 上找到该数据集:https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets
Typically, fine-tuning multimodal large models involves using custom datasets for the process. Here, we will demonstrate a runnable demo. We use the aishell1-zh-mini dataset for fine-tuning, which you can find on Modelscope at: https://modelscope.cn/datasets/speech_asr/speech_asr_aishell1_trainsets
使用python:(Using python)
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import sft_main, SftArguments, ModelType, DatasetName
sft_main(SftArguments(model_type=ModelType.qwen2_audio_7b_instruct,
model_id_or_path=None,
dataset=[DatasetName.aishell1_zh_mini]))
ZeRO2:
# 如果是本地路径需要增加:`--model_id_or_path <local_path>` (If it is a local path, it needs to be added.)
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen2-audio-7b-instruct \
--dataset aishell1-zh-mini \
--deepspeed default-zero2
如果要使用自定义数据集,只需按以下方式进行指定:(If you want to use a custom dataset, simply specify it as follows:)
# val_dataset可选,如果不指定,则会从dataset中切出一部分数据集作为验证集
--dataset train.jsonl \
--val_dataset val.jsonl \
自定义数据集支持json和jsonl样式。以下提供了两种自定义数据集格式:(Custom datasets support JSON and JSONL formats. Below are two formats for custom datasets:)
[
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio>11111"},
{"from": "assistant", "value": "22222"}
]},
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio><audio>audio_path2</audio><audio>audio_path3</audio>aaaaa"},
{"from": "assistant", "value": "bbbbb"},
{"from": "user", "value": "<audio>audio_path</audio>ccccc"},
{"from": "assistant", "value": "ddddd"}
]},
{"conversations": [
{"from": "user", "value": "AAAAA"},
{"from": "assistant", "value": "BBBBB"},
{"from": "user", "value": "CCCCC"},
{"from": "assistant", "value": "DDDDD"}
]}
]
{"query": "<audio>55555", "response": "66666", "audios": ["audio_path"]}
{"query": "<audio><audio>eeeee", "response": "fffff", "history": [], "audios": ["audio_path1", "audio_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}
显存占用:(Memory Usage)
微调后推理脚本:(Fine-tuned inference script:)
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true
# merge-lora and inference
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen2-audio-7b-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true --merge_lora true
微调后模型对验证集进行推理的示例,时间原因,只跑了400个steps:(Example of the model performing inference on the validation set after fine-tuning. Due to time constraints, only 400 steps were run)
训练过程中的log没有报告acc值,这个是我设置的问题吗?
export WANDB_API_KEY=""
swift sft \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path "" \
--sft_type full \
--freeze_parameters 0.999 \
--template_type AUTO \
--dtype AUTO \
--output_dir output \
--custom_train_dataset_path "" \
--val_dataset ''\
--val_dataset_sample -1 \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 2048 \
--check_dataset_strategy warning \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--gradient_accumulation_steps 32 \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--lazy_tokenize true \
--evaluation_strategy 'no' \
--system '' \
--save_strategy "steps" \
--report_to 'wandb' \
--acc_strategy 'token' \
--acc_steps 10
Hi @Jintao-Huang ,
I'd be interested in further finetuning it to improve on german language. Are there any plans to include this architecture in mergekit? Obviously, my thoughts were to either:
- Merge e.g. VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct into it (assuming that this merge won't injure the audio layers)
- Finetune it on a german dataset (most likely synthetic)
Any hints on how to proceed?
Best Julian
Qwen2-Audio微调时可以选择的lora_target_modules有哪些呢
check了一下 peft_config.target_modules里面是空的
Qwen2-Audio微调时可以选择的lora_target_modules有哪些呢
https://github.com/modelscope/ms-swift/issues/1747
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'}
{'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'}
{'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'}
{'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'}
{'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'}
{'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}
数据格式
{"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}
命令行参数
OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \
--model_type qwen2-audio-7b-instruct \
--model_id_or_path ./Qwen2-Audio-7B-Instruct \
--tuner_backend peft \
--dataset ./total_audios_prompt_qwen2.jsonl \
--dataset_test_ratio 0.01 \
--dataloader_num_workers 1 \
--report_to "none" \
--max_length 1024 \
--save_steps 100 \
--eval_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--gradient_accumulation_steps 5 \
--output_dir output \
--save_total_limit 50 \
--lazy_tokenize true \
--preprocess_num_proc 1 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--sft_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--use_flash_attn false \
--dtype bf16 \
--warmup_ratio 0.05 \
--num_train_epochs 1
How to achieve batch inference based on swift framework? Is there any parameter like --batch-size to accelerate the swift infer script?
如何使用vllm或lmdeploy进行加速呢
Qwen2-audio微调时能使用lora训练只训练audio-encoder部分吗?怎么配置能实现此功能?@Jintao-Huang
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'} {'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'} {'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'} {'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'} {'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'} {'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}数据格式
{"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}命令行参数
OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path ./Qwen2-Audio-7B-Instruct \ --tuner_backend peft \ --dataset ./total_audios_prompt_qwen2.jsonl \ --dataset_test_ratio 0.01 \ --dataloader_num_workers 1 \ --report_to "none" \ --max_length 1024 \ --save_steps 100 \ --eval_steps 100 \ --logging_steps 10 \ --batch_size 16 \ --gradient_accumulation_steps 5 \ --output_dir output \ --save_total_limit 50 \ --lazy_tokenize true \ --preprocess_num_proc 1 \ --weight_decay 0.1 \ --learning_rate 1e-4 \ --sft_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --use_flash_attn false \ --dtype bf16 \ --warmup_ratio 0.05 \ --num_train_epochs 1
我也遇到这个问题了,一个数据集微调顺利,另一个数据集数个step后稳定出现Nan,各种排查最后发现读取到了损坏的数据,建议log一下transformer包的trainer.py,出现Nan后反复确认下当前step和之前一个step的数据。
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift) pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm] # 安装最新的transformers(Install the latest transformers.) pip install git+https://github.com/huggingface/transformers.git pip install librosa推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct # 如果是本地路径(If it is a local path.) CUDA_VISIBLE_DEVICES=0 swift infer \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path '<local_path>'推理效果:(Inference result:)
<<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav Yes, I can guess that you are a female in your twenties. -------------------------------------------------- <<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav 每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。 -------------------------------------------------- <<< clear <<< 你是谁 我是来自达摩院的语言模型,我叫通义千问。使用python调用:(Using Python)
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, inference_stream ) from swift.utils import seed_everything import torch model_type = ModelType.qwen2_audio_7b_instruct model_id_or_path = None template_type = get_default_template_type(model_type) print(f'template_type: {template_type}') model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'auto'}) model.generation_config.max_new_tokens = 256 template = get_template(template_type, tokenizer) seed_everything(42) query = '<audio>这段语音说了什么' audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'] response, history = inference(model, template, query, audios=audios) print(f'query: {query}') print(f'response: {response}') # 流式(streaming) query = '这段语音是男生还是女生' gen = inference_stream(model, template, query, history, audios=audios) print_idx = 0 print(f'query: {query}\nresponse: ', end='') for response, history in gen: delta = response[print_idx:] print(delta, end='', flush=True) print_idx = len(response) print() print(f'history: {history}') """ query: <audio>这段语音说了什么 response: 这段语音说的是:'今天天气真好呀' query: 这段语音是男生还是女生 response: 男声。 history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']] """显存占用:(Memory usage:)
Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b推理效果:(Inference result)
<<< <audio>Generate the caption in English: Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3 Glass is breaking.
使用python调用:(Using Python) 推理结果不正确。输出跟示例不一致是怎么回事 query: 这段语音说了什么 response: <|endoftext|>ORIZONTAL_RULE' is not defined
我遇到一个问题,“You are attempting to use Flash Attention 2.0 without specifying a torch dtype.” 跑起来的时候,报错。我想问下,qwen2-audio 需要安装那个版本的flash attn ??
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
{'loss': 2.09631252, 'grad_norm': 7.01568747, 'learning_rate': 3.4e-07, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.027149, 'epoch': 0.0, 'global_step/max_steps': '1/5828', 'percentage': '0.02%', 'elapsed_time': '34s', 'remaining_time': '2d 7h 26m 48s'} {'loss': 1.99507056, 'grad_norm': 6.31390953, 'learning_rate': 3.42e-06, 'memory(GiB)': 63.49, 'train_speed(iter/s)': 0.029137, 'epoch': 0.0, 'global_step/max_steps': '10/5828', 'percentage': '0.17%', 'elapsed_time': '5m 40s', 'remaining_time': '2d 7h 2m 59s'} {'loss': 1.66510525, 'grad_norm': 4.81519556, 'learning_rate': 6.85e-06, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029251, 'epoch': 0.0, 'global_step/max_steps': '20/5828', 'percentage': '0.34%', 'elapsed_time': '11m 21s', 'remaining_time': '2d 6h 56m 50s'} {'loss': 1.06762638, 'grad_norm': 3.60125303, 'learning_rate': 1.027e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029195, 'epoch': 0.01, 'global_step/max_steps': '30/5828', 'percentage': '0.51%', 'elapsed_time': '17m 4s', 'remaining_time': '2d 7h 1m 36s'} {'loss': 0.48049116, 'grad_norm': 1.70112872, 'learning_rate': 1.37e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029208, 'epoch': 0.01, 'global_step/max_steps': '40/5828', 'percentage': '0.69%', 'elapsed_time': '22m 46s', 'remaining_time': '2d 6h 56m 30s'} {'loss': 1.17152777, 'grad_norm': nan, 'learning_rate': 1.712e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029222, 'epoch': 0.01, 'global_step/max_steps': '50/5828', 'percentage': '0.86%', 'elapsed_time': '28m 28s', 'remaining_time': '2d 6h 50m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.055e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029447, 'epoch': 0.01, 'global_step/max_steps': '60/5828', 'percentage': '1.03%', 'elapsed_time': '33m 54s', 'remaining_time': '2d 6h 20m 28s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.397e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.0296, 'epoch': 0.01, 'global_step/max_steps': '70/5828', 'percentage': '1.20%', 'elapsed_time': '39m 22s', 'remaining_time': '2d 5h 58m 35s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.74e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029741, 'epoch': 0.01, 'global_step/max_steps': '80/5828', 'percentage': '1.37%', 'elapsed_time': '44m 47s', 'remaining_time': '2d 5h 38m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.082e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029822, 'epoch': 0.02, 'global_step/max_steps': '90/5828', 'percentage': '1.54%', 'elapsed_time': '50m 15s', 'remaining_time': '2d 5h 24m 5s'} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.425e-05, 'memory(GiB)': 63.5, 'train_speed(iter/s)': 0.029877, 'epoch': 0.02, 'global_step/max_steps': '100/5828', 'percentage': '1.72%', 'elapsed_time': '55m 44s', 'remaining_time': '2d 5h 12m 53s'}数据格式
{"conversations": [{"from": "user", "value": "<audio>xxxx.wav</audio>textabcd"}, {"from": "assistant", "value": "texthijk"}]}命令行参数
OMP_NUM_THREADS=4 NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=3,4 swift sft \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path ./Qwen2-Audio-7B-Instruct \ --tuner_backend peft \ --dataset ./total_audios_prompt_qwen2.jsonl \ --dataset_test_ratio 0.01 \ --dataloader_num_workers 1 \ --report_to "none" \ --max_length 1024 \ --save_steps 100 \ --eval_steps 100 \ --logging_steps 10 \ --batch_size 16 \ --gradient_accumulation_steps 5 \ --output_dir output \ --save_total_limit 50 \ --lazy_tokenize true \ --preprocess_num_proc 1 \ --weight_decay 0.1 \ --learning_rate 1e-4 \ --sft_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --use_flash_attn false \ --dtype bf16 \ --warmup_ratio 0.05 \ --num_train_epochs 1
使用fp32训练就不会出现nan,经过排查, feature_extractor_whisper的call函数入参do_normalize默认为False, 而processing_qwen2_audio.py的call函数里audio_inputs = self.feature_extractor( audios, sampling_rate=sampling_rate, return_attention_mask=True, padding="max_length", **kwargs )没有传入do_normalize,也就是没有对原始音频数据做归一化,原始音频数值过大或过小时,可能超出bf16或fp16的的范围或精度,导致loss变成nan, 我会提一个PR给processing加上这个optional参数do_normalize == 实际不是这里导致的,可忽略
mark
qwen2-audio 微调的是哪部分,是language_model部分还是 模型全部. @Jintao-Huang
qwen2-audio 微调的是哪部分,是language_model部分还是 模型全部. @Jintao-Huang
lora默认是 language_model, 可以设置为 --target_modules ALL 来训练全部的linear层 全参数默认是 全部参数, 可以通过--freeze_vit来冻结encoder部分
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
发一下shell
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in
您好, 执行以下代码报错,麻烦您看看应该如何改?
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft
--model_type qwen2-audio-7b-instruct
--dataset aishell1-zh-mini
--deepspeed default-zero2
抱歉,刚看到邮件,全部,现在可以已经搞好了
在 2024-10-30 09:52:20,"liufeiran" @.***> 写道:
qwen2-audio 微调的是哪部分,是language_model部分还是 模型全部. @Jintao-Huang
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
发一下shell CUDA_VISIBLE_DEVICES=0 swift sft
--model_type qwen2-audio-7b-instruct
--model_id_or_path Model_Files/Qwen2-Audio-7B-Instruct
--tuner_backend peft
--template_type AUTO
--dtype AUTO
--train_dataset_sample -1
--max_length 2048
--lora_rank 8
--lora_alpha 32
--lora_dropout_p 0.05
--weight_decay 0.1
--learning_rate 1e-4
--max_grad_norm 0.5
--warmup_ratio 0.03
--save_total_limit 2
--batch_size 4
--use_flash_attn false
--lazy_tokenize true
--dataset train_v1.jsonl
--val_dataset test_v1.jsonl
--output_dir /output
>> ./sft-v1.log 数据共六万五千条,做情感描述的。在第五百步loss大概6.,但在三千步地时候loss还是6.,一直在6.5左右震荡,无法继续收敛。但是用Qwen1-Audio即可收敛到0.8*
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
我也遇到这个问题了,一个数据集微调顺利,另一个数据集数个step后稳定出现Nan,各种排查最后发现读取到了损坏的数据,建议log一下transformer包的trainer.py,出现Nan后反复确认下当前step和之前一个step的数据。
排查了一下数据,发现有的音频长度太短,在whisper_feature_extract的时候,窗口大小为400个frame,如果音频小于400,会导致实际的mel feature长度为0,也就是实际进入计算的是0tensor,会出现奇怪的问题。 在训练前,先对所有的音频和文本做一遍长度校验,过短的丢弃,训练就正常了
[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in [rank0]: sft_main() [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/utils/run_utils.py", line 32, in x_main [rank0]: result = llm_x(args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/llm/sft.py", line 545, in llm_sft [rank0]: return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/llm/sft.py", line 495, in trainer_train [rank0]: trainer.train(training_args.resume_from_checkpoint) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 488, in train [rank0]: res = super().train(resume_from_checkpoint, *args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2140, in train [rank0]: return inner_training_loop( [rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop [rank0]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 564, in _maybe_log_save_evaluate [rank0]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3018, in _maybe_log_save_evaluate [rank0]: self._save_checkpoint(model, trial) [rank0]: File "/usr/local/lib/python3.10/site-packages/swift/trainers/mixin.py", line 386, in _save_checkpoint [rank0]: result = super()._save_checkpoint(model, trial, metrics) [rank0]: TypeError: Trainer._save_checkpoint() takes 3 positional arguments but 4 were given
您好, 执行以下代码报错,麻烦您看看应该如何改? NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft --model_type qwen2-audio-7b-instruct --dataset aishell1-zh-mini --deepspeed default-zero2
请使用transformers<4.46或者升级ms-swift到2.5.2
您好,我使用自己的数据集,根据您的设置进行微调Qwen2-Audio,结果输出很奇怪,无法完全遵循指令。使用相同的脚本去微调Qwen1-Audio就没有任何问题。您觉得可能的原因是什么?
发一下shell CUDA_VISIBLE_DEVICES=0 swift sft --model_type qwen2-audio-7b-instruct --model_id_or_path Model_Files/Qwen2-Audio-7B-Instruct --tuner_backend peft --template_type AUTO --dtype AUTO --train_dataset_sample -1 --max_length 2048 --lora_rank 8 --lora_alpha 32 --lora_dropout_p 0.05 --weight_decay 0.1 --learning_rate 1e-4 --max_grad_norm 0.5 --warmup_ratio 0.03 --save_total_limit 2 --batch_size 4 --use_flash_attn false --lazy_tokenize true --dataset train_v1.jsonl --val_dataset test_v1.jsonl --output_dir /output >> ./sft-v1.log 数据共六万五千条,做情感描述的。在第五百步loss大概6.,但在三千步地时候loss还是6.,一直在6.5左右震荡,无法继续收敛。但是用Qwen1-Audio即可收敛到0.8*
看看是否是这个原因:
https://github.com/modelscope/ms-swift/issues/2361
请教一下,我在做lora sft时,几个step之后loss变成0,grad_norm变成nan,此后就一直这样,尝试了不同的lora参数和batch_size,结果一定会变成0和nan,只是开始的step数量不同,各位大神能不能给点建议,可能是哪里的问题
我也遇到这个问题了,一个数据集微调顺利,另一个数据集数个step后稳定出现Nan,各种排查最后发现读取到了损坏的数据,建议log一下transformer包的trainer.py,出现Nan后反复确认下当前step和之前一个step的数据。
排查了一下数据,发现有的音频长度太短,在whisper_feature_extract的时候,窗口大小为400个frame,如果音频小于400,会导致实际的mel feature长度为0,也就是实际进入计算的是0tensor,会出现奇怪的问题。 在训练前,先对所有的音频和文本做一遍长度校验,过短的丢弃,训练就正常了
好的 感谢分享
环境准备 (Environmental Preparation)
# 安装ms-swift (Install ms-swift) pip install git+https://github.com/modelscope/swift.git#egg=ms-swift[llm] # 安装最新的transformers(Install the latest transformers.) pip install git+https://github.com/huggingface/transformers.git pip install librosa推理(Inference)
instruct model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b-instruct # 如果是本地路径(If it is a local path.) CUDA_VISIBLE_DEVICES=0 swift infer \ --model_type qwen2-audio-7b-instruct \ --model_id_or_path '<local_path>'推理效果:(Inference result:)
<<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav Yes, I can guess that you are a female in your twenties. -------------------------------------------------- <<< <audio> Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav 每个人都希望被欣赏,所以如果你欣赏某人,不要把它保密。 -------------------------------------------------- <<< clear <<< 你是谁 我是来自达摩院的语言模型,我叫通义千问。使用python调用:(Using Python)
import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, inference_stream ) from swift.utils import seed_everything import torch model_type = ModelType.qwen2_audio_7b_instruct model_id_or_path = None template_type = get_default_template_type(model_type) print(f'template_type: {template_type}') model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_kwargs={'device_map': 'auto'}) model.generation_config.max_new_tokens = 256 template = get_template(template_type, tokenizer) seed_everything(42) query = '<audio>这段语音说了什么' audios = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav'] response, history = inference(model, template, query, audios=audios) print(f'query: {query}') print(f'response: {response}') # 流式(streaming) query = '这段语音是男生还是女生' gen = inference_stream(model, template, query, history, audios=audios) print_idx = 0 print(f'query: {query}\nresponse: ', end='') for response, history in gen: delta = response[print_idx:] print(delta, end='', flush=True) print_idx = len(response) print() print(f'history: {history}') """ query: <audio>这段语音说了什么 response: 这段语音说的是:'今天天气真好呀' query: 这段语音是男生还是女生 response: 男声。 history: [['<audio>这段语音说了什么', "这段语音说的是:'今天天气真好呀'"], ['这段语音是男生还是女生', '男声。']] """显存占用:(Memory usage:)
Base Model:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-audio-7b推理效果:(Inference result)
<<< <audio>Generate the caption in English: Input an audio path or URL <<< https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3 Glass is breaking.
你好请教一下当我sft_type设置为full的时候,model_type=qwen2-audio-7b-instruct,在a100上发现显存占用只有10个g左右,感觉对于7B这个大小这个显存占用是不是有点不太对,还是说哪里的设置不对,感觉这个显存占用不太像是一个7b模型全参数量sft,感谢解答
ps:考了auto_find_batch_size但是发现输出的sft_args.json中bs是1
多轮对话的数据再在训练的时候,是只有最后一个回复会被用来计算loss吗?
上面实例中aishell 数据集构建形式是怎么样的?直接从魔搭社区下载的数据,搭配aishell的wav文件,总是报错。但是通过Swift脚本下载到缓存文件夹中的数据,看不到具体数据构建的json文件,请问类似于ASR,语音翻译的数据应该如何构建json文件呢
Qwen2-audio微调时能使用lora训练只训练audio-encoder部分吗?怎么配置命令啊 @HyacinthJingjing

