wenet [Audio LLM]support audiollm for asr, based on whisper and llama3

conduct experiment on librispeech dataset for severl steps:

May 20 '24 04:05 Zth9730

Can you provide config.yaml or experimental results?

May 23 '24 07:05 thsxbw

Can you provide config.yaml or experimental results?

Yes, there will be new commits later.

May 23 '24 08:05 Zth9730

Is this branch support qwen?

Aug 02 '24 10:08 fclearner

audiollm.yaml

The hyperparameters, such as learning rate and warmup, may not be the best.

accum_grad: 1
cmvn: null
cmvn_conf:
  cmvn_file: null
  is_json_cmvn: null
dataset: audio_llm
dataset_conf:
  batch_conf:
    batch_type: static
    batch_size: 4
  cycle: 1
  data_style: audiosft
  data_style_conf:
    add_bos: true
    add_eos: true
    template: audio_llama3
  feats_type: log_mel_spectrogram
  filter_audio_conf:
    max_length: 3000
    min_length: 0
  filter_conf:
    token_max_length: 8192
    token_min_length: 1
  log_mel_spectrogram_conf:
    hop_length: 160
    n_fft: 400
    num_mel_bins: 128
    pad_or_trim: ture
    padding: 0
  resample_conf:
    resample_rate: 16000
  shift: true
  shuffle: true
  shuffle_conf:
    shuffle_size: 1500
  shuffle_list: true
  shuffle_list_conf:
    shuffle_size: 15000
  sort: true
  sort_conf:
    sort_size: 500
    spec_aug: true
  spec_aug_conf:
    max_f: 10
    max_t: 50
    num_f_mask: 0
    num_t_mask: 0
  spec_sub: false
  spec_sub_conf:
    max_t: 30
    num_t_sub: 3
  spec_trim: false
  speed_perturb: false
decoder: decoder_only
decoder_conf:
  activation_type: swish
  attention_dropout_rate: 0.0
  attention_heads: 32
  dropout_rate: 0.0
  gelu_approximate: null
  gradient_checkpointing: true
  head_dim: 128
  hidden_size: 4096
  linear_units: 14336
  max_position_embeding: 8192
  n_kv_head: 8
  norm_eps: 1.0e-05
  normalize_before: true
  num_blocks: 32
  positional_dropout_rate: 0.0
  rms_norm_offset: false
  rope_style: llama
  rope_theta: 500000.0
  scale_embed: false
  use_sdpa: true
encoder: transformer
encoder_conf:
  activation_type: gelu
  attention_dropout_rate: 0.0
  attention_heads: 20
  dropout_rate: 0.1
  gradient_checkpointing: true
  input_layer: conv1d2
  key_bias: false
  linear_units: 5120
  normalize_before: true
  num_blocks: 32
  output_size: 1280
  pos_enc_layer_type: abs_pos_whisper
  positional_dropout_rate: 0.1
  static_chunk_size: -1
  use_dynamic_chunk: false
  use_dynamic_left_chunk: false
  use_sdpa: true
grad_clip: 1
input_dim: 128
log_interval: 40
max_epoch: 3
save_limited: 1
save_best_ckpt: True
model: audio_llm
model_conf:
  bottleneck_mid_dim: 512
  bottleneck_type: conv-linear
  conv_kernel_sizes:
  - 3
  - 3
  - 3
  length_normalized_loss: false
  linear_bias: false
  lsm_weight: 0.1
  tie_word_embedding: false
  freeze_decoder: true
  freeze_encoder: true
  freeze_llm_embed: false
optim: adamw
optim_conf:
  lr: 4.0e-05
  weight_decay: 0.01
output_dim: 128256
save_interval: 2000
save_states: model_only
scheduler: warmuplr
scheduler_conf:
  warmup_steps: 1000
tokenizer: huggingface
tokenizer_conf:
  model: meta-llama/Meta-Llama-3-8B
  special_tokens:
    <|begin_of_text|>: 128000
    <|end_header_id|>: 128007
    <|end_of_text|>: 128001
    <|eot_id|>: 128009
    <|start_header_id|>: 128006
vocab_size: 128256

decode scripts:

temperature=1.0
top_p=1.0
top_k=1
for test in $recog_set; do
    result_dir=$dir/${test}
    python wenet/bin/audiollm_recognize.py --gpu 0 \
      --config $dir/train.yaml \
      --data_type raw \
      --dtype bf16 \
      --test_data $wave_data/$test/data.list \
      --checkpoint $decode_checkpoint \
      --output_len 256 \
      --temperature $temperature \
      --top_p $top_p \
      --top_k $top_k \
      --result_dir $result_dir
    test_dir=$result_dir/temp${temperature}_topk${top_k}_topp${top_p}
    python tools/compute-wer.py --char=1 --v=1 \
      $wave_data/$test/text $test_dir/text > $test_dir/wer
  done

Aug 07 '24 02:08 Zth9730

Is this branch support qwen?

可以参考周神的代码把qwen的weight转成wenet的，就可以支持🐶

Aug 07 '24 02:08 Zth9730

rebase 一下main， LLM有一部分已经合到main了

Aug 07 '24 06:08 Mddct

wenet wenet copied to clipboard

[Audio LLM]support audiollm for asr, based on whisper and llama3

audiollm.yaml

The hyperparameters, such as learning rate and warmup, may not be the best.

decode scripts:

wenet
wenet copied to clipboard