Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

[BUG]: ns2_dataset.py does not have this two part, phones and num_frames, which must be need in ns2_trainer.py

Open a897456 opened this issue 10 months ago • 4 comments

https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L131 https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_trainer.py#L269

These two elements are not integrated into train.json which will be used in ns2_trainer.py

a897456 avatar Mar 30 '24 05:03 a897456

I am also facing the same problem. You can work around this problem temporarily: https://github.com/open-mmlab/Amphion/blob/5cb75d8d605ef12c90c64ba2e04919f4d5d834a1/models/tts/naturalspeech2/ns2_dataset.py#L121 You can replace the above line with

with open(os.path.join(self.phone_dir, uid + ".phone"), "r") as f:
    self.utt2phone[utt] = f.read().strip()

while setting

self.phone_dir = os.path.join(processed_data_dir, 'phones')

in the __init__ of NS2Dataset

You can just comment out the parts containing frame counts because that is only being used to perform dynamic batching. Also, set "use_dynamic_batchsize": false in exp_config.json

shreeshailgan avatar Apr 01 '24 10:04 shreeshailgan

Hi, you need to generate the phone sequence and record the number of frames of samples.

HeCheng0625 avatar Apr 02 '24 12:04 HeCheng0625

does number of frames mean the number of phones in the phone sequence?

shreeshailgan avatar Apr 02 '24 17:04 shreeshailgan

does number of frames mean the number of phones in the phone sequence?

Hi @shreeshailgan , according to the NS2 paper, "As shown in Figure 2, our neural audio codec consists of an audio encoder, a residual vector-quantizer (RVQ), and an audio decoder: 1) The audio encoder consists of several convolutional blocks with a total downsampling rate of 200 for 16KHz audio, i.e., each frame corresponds to a 12.5ms speech segment." You could refer to https://arxiv.org/pdf/2304.09116.pdf for more details.

HarryHe11 avatar Apr 06 '24 03:04 HarryHe11