wenet
wenet copied to clipboard
[Audio LLM]support audiollm for asr, based on whisper and llama3
conduct experiment on librispeech dataset for severl steps:
Can you provide config.yaml or experimental results?
Can you provide config.yaml or experimental results?
Yes, there will be new commits later.
Is this branch support qwen?
audiollm.yaml
The hyperparameters, such as learning rate and warmup, may not be the best.
accum_grad: 1
cmvn: null
cmvn_conf:
cmvn_file: null
is_json_cmvn: null
dataset: audio_llm
dataset_conf:
batch_conf:
batch_type: static
batch_size: 4
cycle: 1
data_style: audiosft
data_style_conf:
add_bos: true
add_eos: true
template: audio_llama3
feats_type: log_mel_spectrogram
filter_audio_conf:
max_length: 3000
min_length: 0
filter_conf:
token_max_length: 8192
token_min_length: 1
log_mel_spectrogram_conf:
hop_length: 160
n_fft: 400
num_mel_bins: 128
pad_or_trim: ture
padding: 0
resample_conf:
resample_rate: 16000
shift: true
shuffle: true
shuffle_conf:
shuffle_size: 1500
shuffle_list: true
shuffle_list_conf:
shuffle_size: 15000
sort: true
sort_conf:
sort_size: 500
spec_aug: true
spec_aug_conf:
max_f: 10
max_t: 50
num_f_mask: 0
num_t_mask: 0
spec_sub: false
spec_sub_conf:
max_t: 30
num_t_sub: 3
spec_trim: false
speed_perturb: false
decoder: decoder_only
decoder_conf:
activation_type: swish
attention_dropout_rate: 0.0
attention_heads: 32
dropout_rate: 0.0
gelu_approximate: null
gradient_checkpointing: true
head_dim: 128
hidden_size: 4096
linear_units: 14336
max_position_embeding: 8192
n_kv_head: 8
norm_eps: 1.0e-05
normalize_before: true
num_blocks: 32
positional_dropout_rate: 0.0
rms_norm_offset: false
rope_style: llama
rope_theta: 500000.0
scale_embed: false
use_sdpa: true
encoder: transformer
encoder_conf:
activation_type: gelu
attention_dropout_rate: 0.0
attention_heads: 20
dropout_rate: 0.1
gradient_checkpointing: true
input_layer: conv1d2
key_bias: false
linear_units: 5120
normalize_before: true
num_blocks: 32
output_size: 1280
pos_enc_layer_type: abs_pos_whisper
positional_dropout_rate: 0.1
static_chunk_size: -1
use_dynamic_chunk: false
use_dynamic_left_chunk: false
use_sdpa: true
grad_clip: 1
input_dim: 128
log_interval: 40
max_epoch: 3
save_limited: 1
save_best_ckpt: True
model: audio_llm
model_conf:
bottleneck_mid_dim: 512
bottleneck_type: conv-linear
conv_kernel_sizes:
- 3
- 3
- 3
length_normalized_loss: false
linear_bias: false
lsm_weight: 0.1
tie_word_embedding: false
freeze_decoder: true
freeze_encoder: true
freeze_llm_embed: false
optim: adamw
optim_conf:
lr: 4.0e-05
weight_decay: 0.01
output_dim: 128256
save_interval: 2000
save_states: model_only
scheduler: warmuplr
scheduler_conf:
warmup_steps: 1000
tokenizer: huggingface
tokenizer_conf:
model: meta-llama/Meta-Llama-3-8B
special_tokens:
<|begin_of_text|>: 128000
<|end_header_id|>: 128007
<|end_of_text|>: 128001
<|eot_id|>: 128009
<|start_header_id|>: 128006
vocab_size: 128256
decode scripts:
temperature=1.0
top_p=1.0
top_k=1
for test in $recog_set; do
result_dir=$dir/${test}
python wenet/bin/audiollm_recognize.py --gpu 0 \
--config $dir/train.yaml \
--data_type raw \
--dtype bf16 \
--test_data $wave_data/$test/data.list \
--checkpoint $decode_checkpoint \
--output_len 256 \
--temperature $temperature \
--top_p $top_p \
--top_k $top_k \
--result_dir $result_dir
test_dir=$result_dir/temp${temperature}_topk${top_k}_topp${top_p}
python tools/compute-wer.py --char=1 --v=1 \
$wave_data/$test/text $test_dir/text > $test_dir/wer
done
Is this branch support qwen?
可以参考周神的代码把qwen的weight转成wenet的,就可以支持🐶
rebase 一下main, LLM有一部分已经合到main了