Amphion
Amphion copied to clipboard
Feature Alignment in SVC dataset
I am trying to use latent feature from Encodec as the condition for SVC diffusion network. However, I encountered some problem when aligning the length of the Encodec feature sequence to the length of Mel spectrogram. Specifically, I tried to call the offline_align()
function in __getitem()__
of SVCDataset
, but I am not sure how to calculate source_hop
:
source_hop = (
self.cfg.preprocess.whisper_frameshift
* self.cfg.preprocess.whisper_downsample_rate
* self.cfg.preprocess.sample_rate
)
So my questions are:
- How does
source_hop
andtarget_hop
come? I am not sure if neural codecs like Encodec or SpeechTokenizer have a "frameshift". How should I calculatesource_hop
on this occasion? - It is said that the frameshift of content features and Mel spectrogram should not differ much. Considering this, is it still reasonable to utilize Encodec features as the condition? (the strides in Encodec encoder are [2, 4, 5, 8], so I suppose the downsample rate is 320)
Thanks for such a valuable question! Using Encodec as the latent feature is also our future "research" work. We appreciate that you use Amphion as your codebase!
@Adorable-Qin Zihao, would you follow up the question about the frameshift? @VocodexElysium Yicheng, please help to provide some background knowledge about Encodec's frameshift.
I am trying to use latent feature from Encodec as the condition for SVC diffusion network. However, I encountered some problem when aligning the length of the Encodec feature sequence to the length of Mel spectrogram. Specifically, I tried to call the
offline_align()
function in__getitem()__
ofSVCDataset
, but I am not sure how to calculatesource_hop
:source_hop = ( self.cfg.preprocess.whisper_frameshift * self.cfg.preprocess.whisper_downsample_rate * self.cfg.preprocess.sample_rate )
So my questions are:
- How does
source_hop
andtarget_hop
come? I am not sure if neural codecs like Encodec or SpeechTokenizer have a "frameshift". How should I calculatesource_hop
on this occasion?- It is said that the frameshift of content features and Mel spectrogram should not differ much. Considering this, is it still reasonable to utilize Encodec features as the condition? (the strides in Encodec encoder are [2, 4, 5, 8], so I suppose the downsample rate is 320)
Yes, the equivalent hop size for the Encodec can be considered as 320. So if you want to utilize the latent representation in the official Encodec checkpoint that Meta released, you should align the content features with that hop size (which is 320).
The source hop size for whisper (which is 240) or other content features is also around this value (320) so I think it is reasonable to utilize Encodec features as the condition.
Hi @Ching-Yee-Chan !
The source_hop
comes from the pre-trained acoustic models used to extract the content features.
For example, if you feed the ContentVec model a 16k sample rate waveform of 5
seconds, the output feature would contain 250
frames, since the label_rate
of ContentVec is 50
.
Likely, assuming the label_rate
of the speech tokenizer you are using is $x$, you would have $\text{source hop} = \frac{\text{sampling rate}}{x}$, where the label_rate
is the output frames per second, i.e. how many frames will the model output if the input is a waveform of 1 second.
The target_hop
here refers to the hop size of our vocoder, BigVGan
, which has a hop size of 320
. To maximize performance, the hop size of the content features you use should not differ much from this.
Hi @Ching-Yee-Chan, if you have any further questions about EnCodec or feature alignment, feel free to re-open this issue!