fairseq
fairseq copied to clipboard
NLP / GSLM: Did anyone succeed in making K-means clustering model? (S2U)
❓ Questions and Help
What is your question?
Hello, I'm having a problem to make a well-converged K-means clustering model for S2U. I am trying to train the K-means clustering model with various corpus types. As my previous trials to train the model was failed, I decided to go back to the origin and tried training as stated in the paper.
Quantization. We use k-means to convert continuous frame representations into discrete representation by training on LibriSpeech clean-100h (Panayotov et al., 2015). We experiment with codebooks that have 50, 100, and 200 units.
In GSLM paper published, they mentioned that they were able to get quantized unit by training on LibriSpeech clean-100h. I downloaded the 100h LibriSpeech-clean corpus, trained K-means clustering model, and did a re-synthesis.
However, the result was very different; the model I trained only produced babbling that it was not able to re-synthesize the input speech any similar. During training, I was able to see that as Minibatch proceeds, the ewa inertia had decreased very tiny bit. It eventually says that the model is converged due to the lack of improvement in inertia.
The input sample, corrupted output sample (trained by the method above), and clean output sample (using pre-trained model) is uploaded below.
Input Sample / Output Sample (clean, pre-trained) / Output Sample (corrupted, mine)
Do you have any assumption why the problem is happening? If there is a GSLM developer who could answer to my question, it would be extremely helpful. Thanks a lot!
Code
First what I did was to make a manifast file of speech corpus through wav2vec_manifest.py
file.
python examples/wav2vec/wav2vec_manifest.py ./examples/wav2vec/LibriSpeech/train-clean-100 --dest ./fairseq/examples/wav2vec/manifest/libri100 --ext flac --valid-percent 0.01
And finally tried to train for K-means clustering model. The code below is directly from README file in the glsm/speech2unit repo.
TYPE='hubert'
ACOUSTIC_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/checkpoints/hubert_base_ls960.pt \
LAYER=6 \
MANIFEST=./examples/wav2vec/manifest/libri100/train.tsv \
KM_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/kmeans_saved/hubert50_new.bin
PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
--num_clusters $N_CLUSTERS \
--feature_type $TYPE \
--checkpoint_path $CKPT_PATH \
--layer $LAYER \
--manifest_path $MANIFEST \
--out_kmeans_model_path $KM_MODEL_PATH
And did re-synthesis using resynthesize_speech.py
. There were only difference of $KM_MODEL_PATH
between clean and corrupted output.
What have you tried?
I first thought that the amount of corpus was not enough, so I downloaded Librispeech 500h in order to enlarge the amount of dataset. However, the resultant also produced babbling sound.
I also tried to change the KM clustering number (codebook length) from 50 to 200, but the result was same.
What's your environment?
(main environment)
- fairseq Version (e.g., 1.0 or main): main(0.12.2)
- PyTorch Version (e.g., 1.0) 1.12.1
- OS (e.g., Linux): Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-125-generic x86_64)
- How you installed fairseq (pip, source):
git clone https://github.com/pytorch/fairseq
- Build command you used (if compiling from source):
pip install --editable ./
- Python version: 3.9.13
- CUDA/cuDNN version: Build cuda_11.3.r11.3/compiler.29745058_0
- GPU models and configuration: Geforce GTX 3090 (NVIDIA Corporation Device 2204 (rev a1))
- Any other relevant information: -
@nonmetal
Hello, have you solved this problem? The keams model I trained for 1000 hours is not working either. The labels from the data are basically the same.
Is it related to the default value of training? In addition to the following parameters
When I execute the following
TYPE='hubert'
ACOUSTIC_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/checkpoints/hubert_base_ls960.pt \
LAYER=6 \
MANIFEST=./examples/wav2vec/manifest/libri100/train.tsv \
KM_MODEL_PATH=./examples/textless_nlp/gslm/speech2unit/kmeans_saved/hubert50_new.bin
PYTHONPATH=. python examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py \
--num_clusters $N_CLUSTERS \
--feature_type $TYPE \
--checkpoint_path $CKPT_PATH \
--layer $LAYER \
--manifest_path $MANIFEST \
--out_kmeans_model_path $KM_MODEL_PATH
I get the following error about a missing argument. Looking into the code for ([https://github.com/facebookresearch/fairseq/blob/main/examples/textless_nlp/gslm/speech2unit/pretrained/utils.py]), the error is legit
Error:
2023-06-08 18:21:04 | INFO | main | Extracting hubert acoustic features...
Traceback (most recent call last):
File "examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py", line 212, in
What is the channel_id? Any reason why it is not included in the arguments when running the cluster_kmeans.py: 'examples/textless_nlp/gslm/speech2unit/clustering/cluster_kmeans.py' ?
Thanks!
@nonmetal @lzl1456 Hi, have you solved this problem? I met the same problem.
@PrabhjotKaurGosal I met the same problem. It is because "cluster_kmeans. py" called the "get_features()" function in "speech2unit/pretrained/utils.py", which requires passing the positional parameter "channel_id". Therefore, I gave the default value of None, which is to change line https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/examples/textless_nlp/gslm/speech2unit/pretrained/utils.py#L71 to
feature_type, checkpoint_path, layer, manifest_path, sample_pct, flatten, channel_id=None
to train the model. But the model I got did not work well.
Do you know if there are any solutions up to date?
I also faced this problem. Just pass channel_id=None
is fine if you use mono or stereo audios.
https://github.com/facebookresearch/fairseq/blob/34973a94d09ecc12092a5ecc8afece5e536b7692/examples/textless_nlp/gslm/speech2unit/pretrained/hubert_feature_reader.py#L34C46-L34C56
Did anyone succeed in making the K-means clustering model? I am facing the same issue in building K-means model to train as it is not giving good result.