Amphion Feature Alignment in SVC dataset

I am trying to use latent feature from Encodec as the condition for SVC diffusion network. However, I encountered some problem when aligning the length of the Encodec feature sequence to the length of Mel spectrogram. Specifically, I tried to call the offline_align() function in __getitem()__ of SVCDataset, but I am not sure how to calculate source_hop:

source_hop = (
                self.cfg.preprocess.whisper_frameshift
                * self.cfg.preprocess.whisper_downsample_rate
                * self.cfg.preprocess.sample_rate
            )

So my questions are:

How does source_hop and target_hop come? I am not sure if neural codecs like Encodec or SpeechTokenizer have a "frameshift". How should I calculate source_hop on this occasion?
It is said that the frameshift of content features and Mel spectrogram should not differ much. Considering this, is it still reasonable to utilize Encodec features as the condition? (the strides in Encodec encoder are [2, 4, 5, 8], so I suppose the downsample rate is 320)

Dec 26 '23 13:12 Ching-Yee-Chan

Thanks for such a valuable question! Using Encodec as the latent feature is also our future "research" work. We appreciate that you use Amphion as your codebase!

@Adorable-Qin Zihao, would you follow up the question about the frameshift? @VocodexElysium Yicheng, please help to provide some background knowledge about Encodec's frameshift.

Dec 26 '23 15:12 RMSnow

I am trying to use latent feature from Encodec as the condition for SVC diffusion network. However, I encountered some problem when aligning the length of the Encodec feature sequence to the length of Mel spectrogram. Specifically, I tried to call the offline_align() function in __getitem()__ of SVCDataset, but I am not sure how to calculate source_hop:
source_hop = (
                self.cfg.preprocess.whisper_frameshift
                * self.cfg.preprocess.whisper_downsample_rate
                * self.cfg.preprocess.sample_rate
            )
So my questions are:

How does source_hop and target_hop come? I am not sure if neural codecs like Encodec or SpeechTokenizer have a "frameshift". How should I calculate source_hop on this occasion?

It is said that the frameshift of content features and Mel spectrogram should not differ much. Considering this, is it still reasonable to utilize Encodec features as the condition? (the strides in Encodec encoder are [2, 4, 5, 8], so I suppose the downsample rate is 320)

Yes, the equivalent hop size for the Encodec can be considered as 320. So if you want to utilize the latent representation in the official Encodec checkpoint that Meta released, you should align the content features with that hop size (which is 320).

The source hop size for whisper (which is 240) or other content features is also around this value (320) so I think it is reasonable to utilize Encodec features as the condition.

Dec 26 '23 16:12 VocodexElysium

Hi @Ching-Yee-Chan !

The source_hop comes from the pre-trained acoustic models used to extract the content features.

For example, if you feed the ContentVec model a 16k sample rate waveform of 5 seconds, the output feature would contain 250 frames, since the label_rate of ContentVec is 50.

Likely, assuming the label_rate of the speech tokenizer you are using is $x$, you would have $\text{source hop} = \frac{\text{sampling rate}}{x}$, where the label_rate is the output frames per second, i.e. how many frames will the model output if the input is a waveform of 1 second.

The target_hop here refers to the hop size of our vocoder, BigVGan, which has a hop size of 320. To maximize performance, the hop size of the content features you use should not differ much from this.

Dec 26 '23 17:12 Adorable-Qin

Hi @Ching-Yee-Chan, if you have any further questions about EnCodec or feature alignment, feel free to re-open this issue!

Dec 30 '23 07:12 RMSnow

Amphion Amphion copied to clipboard

Feature Alignment in SVC dataset

Amphion
Amphion copied to clipboard