soundstream-pytorch How to train a new set of data?

Thanks for your code, but I want to learn how to use your modle to train a new set of data, so can you provide a train.py file?

Nov 02 '23 08:11 a897456

Do you mean you want to train soundstream model with new training data or want to train other model which uses output of soundstream as features? In the first case, you can run python soundstream.py that should download LIBRISPEECH under ./data and start training.

Nov 02 '23 14:11 kaiidams

Thank you for your reply. First, I found that your soundstream models need to download data, including YESNO, LIBRISPEECH or librispeech, which is actually very time-consuming, so I downloaded other new data in advance. Second, I mean the first case, I want to use your soundstream modle to train a new set of data with a sample-rate of 8KHz which I have already downloaded, but I don't know how to load them into your model.

Nov 03 '23 01:11 a897456

My ultimate goal is to achieve low bit rate compression. I would like to train a set of data with a sample rate of 8KHZ through your model, then num_embeddings change from 1024 to 256 and num_quantizers from 8 to 6, and see what the end result is.

Nov 03 '23 01:11 a897456

First Then

Nov 03 '23 01:11 a897456

ds is not a string of directory path, but torch.utils.data.Dataset. If you want to train 8kHz model with LIBRISPEECH, you can change sample_rate. If you want to your custom dataset, you can implement your own Dataset which should not be too difficult.

Nov 03 '23 03:11 kaiidams

Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before?

Nov 06 '23 01:11 a897456

Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before?

I tried to continue training from a location with an epoch of 98 and found no errors. This issue is temporarily considered evaded The second question, it seems that the testing process for the final model has not been found. Can you provide guidance?

Nov 06 '23 09:11 a897456

If you just want to hear the output yourself. You can encode the audio file by calling forward() method. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L500 . If you want to compute ViSQOL, sorry. It has no implementation for that.

Nov 06 '23 12:11 kaiidams

ViSQOL

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L500

. If you want to compute ViSQOL, sorry. It has no implementation for that.

Firt, so how do you determine your model is useful? What are the judgment indicators? Second, how to use the output such as "epoch=84-step=150000.ckpt" to check the availability of the model？

Nov 06 '23 13:11 a897456

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

Nov 06 '23 13:11 kaiidams

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First，I think I have completed 150 training sessions, as shown in the picture. Second, what you mentioned that "model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")", is it possible to replace the. ckpt file labeled in the second image to reconstruct my speech signal, in order to verify the usefulness of the model? Thirdly, is the. ckpt file labeled in the second figure the final training model? I'm sorry, I'm a novice, so there may be many ignorant questions bothering you

Nov 07 '23 01:11 a897456

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

Nov 07 '23 09:11 a897456

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link

Nov 07 '23 09:11 a897456

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.
https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link

Firstly, I'm sorry, please forgive me for being a novice. I didn't have access to open audio files earlier, and you can now access them. Secondly, a few days ago, when the epoch was 84 and an unknown error occurred during the training process, the model generated at that time and the model generated after 150 epochs of training had a significant impact on the output file. Is the epoch too large to handle?

Nov 07 '23 13:11 a897456

SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L175C1-L180C1

I used the same stride for 16kHz with the original paper, 2 * 4 * 5 * 8 = 320 window size. This makes 50Hz embeddings. The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.

Nov 07 '23 14:11 kaiidams

SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L175C1-L180C1

I used the same stride for 16kHz with the original paper, 2 * 4 * 5 * 8 = 320 window size. This makes 50Hz embeddings. The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.

First, I studied TensorBoard for several hours today, but I haven't made any progress yet. My understanding is this: when using TensorBoard, I train with the model.fit() function, but for now I train with the pl.Trainer.fit() function. Do I need to change the training function if I want to use TensorBoard? How should I use TensorBoard. Second, you mentioned that "So 8kHz model has to compress a longer audo window into an embedding", I want to change the window size with 225*8=160. Am I understanding this correctly?

Nov 08 '23 10:11 a897456

It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy in your output. You can launch by tensorboard --logdir lightning_logs/version_X/

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L663C26-L663C26

Or you can find CSV file lightning_logs/version_X/metrics.csv.

225*8=160 makes window size 160. I think it is good number.

Nov 08 '23 10:11 kaiidams

It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy in your output. You can launch by tensorboard --logdir lightning_logs/version_X/

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L663C26-L663C26

Or you can find CSV file lightning_logs/version_X/metrics.csv.

2_2_5*8=160 makes window size 160. I think it is good number.

First，After seeing your reply, I spent several hours trying to open TensorBoard and found that just setting logger=True would suffice. I'm so happy Second, If I want to achieve a low bitrate compression method, such as 1.2kbps, such as an 8KHz sampling rate, if I use a window size of 320, 1200 * 320/8000=48bit, and use six 8-bit codebooks for quantization. If we still use six 8-bit codebooks to achieve 0.6kbps, we need 600 * 640/8000=48bit, which means the window size has changed from 320 to 640. So I seem to need to increase the window size, do you agree?

Nov 08 '23 13:11 a897456

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241 Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.

I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

Nov 09 '23 12:11 kaiidams

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost. I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays. But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?

Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?

Nov 09 '23 12:11 a897456

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost. I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays. But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?

Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?

    [self.register_buffer("code_count", torch.empty(num_quantizers, num_embeddings))](https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L225)

For the second question I mentioned yesterday, I have some new ideas, and I do not know if it is correct. The bit is num_quantizerslog(num_embeddings)=80bit, the code rate is 80bit（16000Hz/320=50frames）=4kbps？Am I right?

Nov 10 '23 07:11 a897456

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature

Yes, you are right, num_quantizers is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 * log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 * 80 = 6k bps. 4kbps in case of 16kHz.

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 * 10 * 50 = 2.5kbps.

Here, you can pass n codes, where n is between 1 and 8 in the inference time. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L264

In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes. https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

Nov 10 '23 10:11 kaiidams

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature

Yes, you are right, num_quantizers is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 * log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 * 80 = 6k bps. 4kbps in case of 16kHz.

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 * 10 * 50 = 2.5kbps.

Thank you for your reply. Your reply is my motivation to continue studying. My understanding is this: I just load your pre-trained model (soundstream_16khz-20230425.ckpt)，and then change the value of n, I can achieve a variety of bit rate compression, no need to repeat training, such as n=4, 4 * 10 * 50 = 2kbps; n = 2, 2 * 10 * 50 = 1kbps;

Here, you can pass n codes, where n is between 1 and 8 in the inference time.

I want to achieve a lower speech compression bit rate. by change the sampling rate to 8KHz(should be the lowest); change the step size to 2 * 4 * 5 * 6 = 240, which corresponds to the sample rate of 8KHz. 'num_quantizers=8' and 'num_embeddings =1024' remain unchanged, epoch=200. Then compare the results with your 16KHz model by change 'n' synchronously.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L264

In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes.

Can the value of n equal 1 which just keep only the most important code?

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L241

Nov 10 '23 11:11 a897456

When I change the step size 2, 4, 5, 6=240, I need to change the segment_length from 32270 to 30430, which I calculated in order to be able to divide the steps exactly. So I would like to ask if the 32270 you set at that time is also to divide the step size? Can I make it bigger or smaller? I wonder if it could be bigger? Because it can include more X content, am I right？

Nov 11 '23 02:11 a897456

I am a PhD student and I want to post an article based on soundstream, but I haven't found any innovation yet, can you guide me something about soundstream? For example, where can I continue to improve soundstream?

At first I wanted to use soundstream to achieve lower bitrates, but I found that soundstream had already implemented it by changing 'n', or retraining the new 'num_quantizers' and' num_embeddings', so I couldn't find a new idea, can you remind me something?TKS

Nov 11 '23 06:11 a897456

In the paper, the authors mentioned that the coding rate is guaranteed to remain the same, and different step sizes will not affect the final score.

So my idea of retraining a new model to achieve a lower bit rate by changing the step size, 2 * 4 * 5 * 6 = 240, might not work.

Nov 11 '23 10:11 a897456

Your reply is my motivation to continue studying.

Thank you! I'm glad to hear that.

I need to change the segment_length from 32270 to 30430,

32270 is nice number so that the output lenght of decoder is the same as the input length of encoder. They are sometimes different because of rounding. I think 30430 is good number for 2 4 5 6.

For example, where can I continue to improve soundstream?

I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.

BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.

Nov 12 '23 16:11 kaiidams

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L500

. If you want to compute ViSQOL, sorry. It has no implementation for that.

https://github.com/aliutkus/speechmetrics I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?

Nov 13 '23 11:11 a897456

For example, where can I continue to improve soundstream?

I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.

Can you be more specific? Because I am a beginner in audio compression and my research direction is very low bit rate compression, I feel that you are an expert in this field, so I would like to hear your specific opinion.

Nov 13 '23 12:11 a897456

BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.

I have already paid attention to two models, soundstream and EnCodec, which are very close to my research direction. So my arrangement is like this: For my first paper, I want to do some research based on soundstream, but I haven't found a suitable research site yet. The second paper wants to do some research based on EnCodec, so I have been studying soundstream recently and will start to study EnCodec after the New Year. This is my plan。

Nov 13 '23 12:11 a897456