fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

UST - KMeans Clustering & Acoustic Model(s) Missing

Open mdconaway opened this issue 1 year ago • 1 comments

Hello fairseq team!

Problem Summary

I've been trying to reproduce the results of this paper for research, and I believe there are some necessary model files / urls that are missing in the README.md file that supports the paper.

In particular, it appears that the paper authors used a novel kmeans quantization model with K = 2500 to perform unit clustering for Hokkien speech, yet the km2500 model (and its associated acoustic model) file was not released, which hinders any further tuning / testing / replication of the English -> Hokkien model provided here, and further it makes the Hokkien vocoder which was released non-reusable, as the vocoder can only be reused if the same kmeans quantization model is used to map other languages into the same quantization space as the vocoder.

The UST branch README does reference the released versions of the English kmeans (km1000) clustering model, and acoustic model, which are both necessary to map and cluster Hokkien (or any other language) -> English speech and then use this vocoder, and I have been able to replicate the training / validation and then translation of the Taiwanese Across Taiwan dataset into English as a result. Therefore, I am able to replicate approximately half of the paper at present. (Hokkien to English)

Question

  1. Will the k = 2500 quantization AND acoustic models be released from this paper to allow for complete replication and further testing of the results? Without the models it is impossible to fully verify the work of the authors.
  2. If not, will a k = 2500 BASE quantization model be released so that a new kmeans quantizer, acoustic model, and vocoder could be trained to attempt to replicate the results? The only HuBERT BASE quantization models that I can find in the fairseq documentation are: KM50, KM100, KM200, as referenced here.

Code

Using the English km1000 models which WERE released in association with this paper, I was able to properly quantize each split of the datasets provided using this methodology:

python -m examples.textless_nlp.gslm.speech2unit.clustering.quantize_with_kmeans --feature_type hubert --kmeans_model_path ./mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path ./mhubert_base_vp_en_es_fr_it3.pt --layer 11 --manifest_path ./TGT_AUDIO/test/test.tsv --extension ".wav" --out_quantized_file_path ./TGT_AUDIO/test.txt

And then I could prep the data for training / tuning like so:


python -m examples.speech_to_speech.preprocessing.prep_s2ut_data --reduce-unit --source-dir ./SRC_AUDIO --target-dir ./TGT_AUDIO --data-split dev test train --output-root ./DATA_ROOT --vocoder-checkpoint ./unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur/model.pt --vocoder-cfg ./unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur/config.json

After the above data preparation, and running a few hours of training / tuning, I was able to successfully map the TAT / Hokkien audio into the English vocoder space, and achieve acceptable results.

In order to replicate the English -> Hokkien S2S portion of the paper however, the km2500 quantizer and acoustic model are absolutely needed to perform the two data preparation steps I highlighted above.

Any help would be greatly appreciated, such as a simple posting of the URLs to download the km2500 quantizer and acoustic model for further evaluation of the paper.

Thank you!

mdconaway avatar Dec 28 '23 14:12 mdconaway

Excuse me! Have you done with this? I'm still dealing with making a new speech2unit module

tarudesu avatar Feb 29 '24 19:02 tarudesu