Amphion
Amphion copied to clipboard
[BUG]: ns2_dataset.py does not have this two part, phones and num_frames, which must be need in ns2_trainer.py
https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L131 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_trainer.py#L269
These two elements are not integrated into train.json which will be used in ns2_trainer.py
I am also facing the same problem. You can work around this problem temporarily: https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 You can replace the above line with
with open(os.path.join(self.phone_dir, uid + ".phone"), "r") as f:
self.utt2phone[utt] = f.read().strip()
while setting
self.phone_dir = os.path.join(processed_data_dir, 'phones')
in the __init__
of NS2Dataset
You can just comment out the parts containing frame counts because that is only being used to perform dynamic batching. Also, set "use_dynamic_batchsize": false
in exp_config.json
Hi, you need to generate the phone sequence and record the number of frames of samples.
does number of frames mean the number of phones in the phone sequence?
does number of frames mean the number of phones in the phone sequence?
Hi @shreeshailgan , according to the NS2 paper, "As shown in Figure 2, our neural audio codec consists of an audio encoder, a residual vector-quantizer (RVQ), and an audio decoder: 1) The audio encoder consists of several convolutional blocks with a total downsampling rate of 200 for 16KHz audio, i.e., each frame corresponds to a 12.5ms speech segment." You could refer to https://arxiv.org/pdf/2304.09116.pdf for more details.