Amphion
Amphion copied to clipboard
[Help]: Requesting some guidance / documentation on choosing appropriate parameters for mssbcqt
Hello! I'd be very glad if I could get some more information how to adapt mssbcqt discriminator for 48khz audio.
Lately I've been trying to improve the current architecture of RVC ( retrieval-based-voice-conversion ) by adopting ms-sb-cqt and ms-stft discriminators however from what I can see, it was tested on ( and supposedly the config is for ) 24khz audio. Essentially, I am interested in receiving some guidance on how to properly decide on params for cqt.:
filters=32,
max_filters=1024,
filters_scale=1,
dilations=[1, 2, 4],
in_channels=1,
out_channels=1,
hop_lengths= [512, 256, 256],
n_octaves=[9, 9, 9],
bins_per_octaves=[24, 36, 48],
For more details, this is the current config I use for training pretrained models for rvc:
},
"data": {
"max_wav_value": 32768.0,
"sampling_rate": 48000,
"filter_length": 2048,
"hop_length": 480,
"win_length": 2048,
"n_mel_channels": 128,
"mel_fmin": 0.0,
"mel_fmax": null
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [12,10,2,2],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [24,20,4,4],
"use_spectral_norm": false,
"gin_channels": 256,
"spk_embed_dim": 109
}
}
As an important note: I intend to pair mssbcqt / msstft combo along with the existing MultiPeriodDiscriminator used in RVC. Kindly thank you in advance!
Bumping up.
@VocodexElysium Hmm.. So then? Any kind of feedback I can count for? ( Pardon the @ but it's been 2 weeks )
Bump. Still waiting patiently ~
@codename0og Thanks for your patience! It seems the author @VocodexElysium hasn't responded yet, some suggestions from me are:
- https://github.com/NVIDIA/BigVGAN configurations since they have 44k config
- check the long paper https://arxiv.org/abs/2404.17161
thanks!
I would like to piggyback the question to ask about the upsampling in CQT discriminators. I couldn't find anything in the paper.
https://github.com/open-mmlab/Amphion/blob/f25ba323234031d6310bcbecbed57bb09fe5fb45/models/vocoders/gan/discriminator/mssbcqtd.py#L39-L41
and
https://github.com/open-mmlab/Amphion/blob/f25ba323234031d6310bcbecbed57bb09fe5fb45/models/vocoders/gan/discriminator/mssbcqtd.py#L110-L115
Why is this necessary, does anyone have insights into it?
EDIT: I have figured it out. Without resampling, CQT analysis parameters result in filterbanks with center frequencies higher than the Nyquist rate.