audiocraft icon indicating copy to clipboard operation
audiocraft copied to clipboard

About Magnet‘s performance

Open RevolGMPHL opened this issue 1 year ago • 6 comments

Why is the performance so poor? Is there a bug, or is the model itself just this poor in performance?

RevolGMPHL avatar Jan 16 '24 12:01 RevolGMPHL

Do you mean sound quality or speed? If sound quality, yes I was expecting better too.

CyberTimon avatar Jan 16 '24 15:01 CyberTimon

  • You can try changing the span arrangement from 'nonoverlap' to 'stride1' - should make audio quality better. This is not the default since it was introduced only recently.
  • The models are sensitive to the generation parameters - you can try playing with these. Especially max_cfg_coef, decoding steps and temperature.
  • There is some degradation in quality in the released checkpoints (trained on 16K hours of data) compared to the models reported in our paper (trained on 20K hours of data). In https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/ you can hear samples from the original models.
  • In general, the quality extremely depends on the datasets, and we encourage the community to experiment with training MAGNeT on new datasets.
  • It is in our plan for the near future to release also the Hybrid-MAGNeT model and the code for model rescoring (see our paper for explanations on both), which should make the results more stable.
  • Non-autoregressive transformers are still under-researched compared to autoregressive transformer decoders. We encourage the community to devise improvements for both the method and the open-sourced code.

lonzi avatar Jan 16 '24 19:01 lonzi

Thanks @lonzi for the answer! Will try out these tips tomorrow.

CyberTimon avatar Jan 16 '24 20:01 CyberTimon

thanks for the answer~ Although I tried several parameters, I still feel the performance is not as good as the original model.

RevolGMPHL avatar Jan 17 '24 02:01 RevolGMPHL

Can you @lonzi kindly provide us your sampling parameters or add a note in the readme with recommended paramteres?

Thank you so much!

CyberTimon avatar Jan 17 '24 10:01 CyberTimon

For Music: span_arrangement: 'stride1' use_sampling: true top-p: 0.9 temperature: 3.0 max_cfg_coef: 10.0 min_cfg_coef: 1.0 decoding_iterations (for 10 secs): [20, 10, 10, 10] decoding_iterations (for 30 secs): [60, 10, 10, 10]

See our paper for the ablation studies.

For Sound [audio-magnet models]: span_arrangement: 'stride1' use_sampling: true top-p: 0.8 temperature: 3.5 max_cfg_coef: 20.0 min_cfg_coef: 1.0 decoding_iterations: [20, 10, 10, 10]

*these are the parameters from the paper ablation studies, not necessarily tuned for the open-source models

lonzi avatar Jan 17 '24 11:01 lonzi