nonebot_plugin_tts_gal icon indicating copy to clipboard operation
nonebot_plugin_tts_gal copied to clipboard

语音生成的有问题

Open NAOLIU opened this issue 1 year ago • 11 comments

过程无报错,使用的是我自己训练的model 我怀疑是config.json出现问题,因为有小部分东西是没有的,我自己加上去的 这个是修改之前的,有些key找不到 修改前的.txt 然后这个是修改后的,自己手动加了一些key 修改后的.txt

而我模型是用G开头而不是D开头的,当然在放进nonebot下的文件夹的时候已经重命名好了 生成也成功了,但就是会变成滋的一声

NAOLIU avatar May 16 '23 02:05 NAOLIU

image 不知道跟symbols的关系大不大,这个我不太懂怎么配置

NAOLIU avatar May 16 '23 02:05 NAOLIU

symbols是必需的,要与模型训练是使用的一致,你可以找找训练使用的是什么,然后添加上去

dpm12345 avatar May 16 '23 02:05 dpm12345

symbols是必需的,要与模型训练是使用的一致,你可以找找训练使用的是什么,然后添加上去

我用的是so-vits-svc 4.0,我应该去哪里找到对应的symbols

NAOLIU avatar May 16 '23 02:05 NAOLIU

so-vits-svc应该不能用,我这个还是vits

dpm12345 avatar May 16 '23 02:05 dpm12345

so-vits-svc应该不能用,我这个还是vits

!这样呀,我还以为模型后缀一样,所以是通用的 他们的模型上的有区别的吗?

后续会不会支持so-vits-svc,我在网上看见很多利用so-vits-svc文字转语音的,so-vits-svc理论上应该也挺多人用,模型数量也多

或者是so-vits-svc的模型能否转换为vits的模型?

NAOLIU avatar May 16 '23 02:05 NAOLIU

很抱歉,这方面我还没有去了解过,暂时回答不了

dpm12345 avatar May 16 '23 02:05 dpm12345

so-vits-svc应该不能用,我这个还是vits

我拿了一个vits模型来试验,但是发现不知道为什么无论输入多长的句子,机器人只输出1秒,而且还只是说一声而已 生成时没有报错,以下是我模型的config.json:

{
  "train": {
    "log_interval": 10,
    "eval_interval": 100,
    "seed": 1234,
    "epochs": 10000,
    "learning_rate": 0.0002,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": true,
    "lr_decay": 0.999875,
    "segment_size": 8192,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0
  },
  "data": {
    "training_files": "final_annotation_train.txt",
    "validation_files": "final_annotation_val.txt",
    "text_cleaners": [
      "chinese_cleaners2"
    ],
    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 3,
    "cleaned_text": true
  },
  "model": {
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "speakers": [
    "Qingzi",
    "specialweek",
    "zhongli"
  ],
  "symbols": [
    "_",
    ",",
    ".",
    "!",
    "?",
    "-",
    "~",
    "\u2026",
    "A",
    "E",
    "I",
    "N",
    "O",
    "Q",
    "U",
    "a",
    "b",
    "d",
    "e",
    "f",
    "g",
    "h",
    "i",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "p",
    "r",
    "s",
    "t",
    "u",
    "v",
    "w",
    "y",
    "z",
    "\u0283",
    "\u02a7",
    "\u02a6",
    "\u026f",
    "\u0279",
    "\u0259",
    "\u0265",
    "\u207c",
    "\u02b0",
    "`",
    "\u2192",
    "\u2193",
    "\u2191",
    " "
  ]
}

我只修改了其中的"text_cleaners",这个模型的json拿到的时候"text_cleaners"是"zh_ja_mixture_cleaners",我改成了"chinese_cleaners2",请问跟这个有关系吗?(其他地方没动过)

NAOLIU avatar May 16 '23 10:05 NAOLIU

so-vits-svc应该不能用,我这个还是vits

我拿了一个vits模型来试验,但是发现不知道为什么无论输入多长的句子,机器人只输出1秒,而且还只是说一声而已 生成时没有报错,以下是我模型的config.json:

{
  "train": {
    "log_interval": 10,
    "eval_interval": 100,
    "seed": 1234,
    "epochs": 10000,
    "learning_rate": 0.0002,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": true,
    "lr_decay": 0.999875,
    "segment_size": 8192,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0
  },
  "data": {
    "training_files": "final_annotation_train.txt",
    "validation_files": "final_annotation_val.txt",
    "text_cleaners": [
      "chinese_cleaners2"
    ],
    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 3,
    "cleaned_text": true
  },
  "model": {
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "speakers": [
    "Qingzi",
    "specialweek",
    "zhongli"
  ],
  "symbols": [
    "_",
    ",",
    ".",
    "!",
    "?",
    "-",
    "~",
    "\u2026",
    "A",
    "E",
    "I",
    "N",
    "O",
    "Q",
    "U",
    "a",
    "b",
    "d",
    "e",
    "f",
    "g",
    "h",
    "i",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "p",
    "r",
    "s",
    "t",
    "u",
    "v",
    "w",
    "y",
    "z",
    "\u0283",
    "\u02a7",
    "\u02a6",
    "\u026f",
    "\u0279",
    "\u0259",
    "\u0265",
    "\u207c",
    "\u02b0",
    "`",
    "\u2192",
    "\u2193",
    "\u2191",
    " "
  ]
}

我只修改了其中的"text_cleaners",这个模型的json拿到的时候"text_cleaners"是"zh_ja_mixture_cleaners",我改成了"chinese_cleaners2",请问跟这个有关系吗?(其他地方没动过)

目前插件内没有zh_ja_mixture_cleaners这一选项,两种方法是不同的,所以出错了

dpm12345 avatar May 16 '23 10:05 dpm12345

so-vits-svc应该不能用,我这个还是vits

我拿了一个vits模型来试验,但是发现不知道为什么无论输入多长的句子,机器人只输出1秒,而且还只是说一声而已 生成时没有报错,以下是我模型的config.json:

{
  "train": {
    "log_interval": 10,
    "eval_interval": 100,
    "seed": 1234,
    "epochs": 10000,
    "learning_rate": 0.0002,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": true,
    "lr_decay": 0.999875,
    "segment_size": 8192,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0
  },
  "data": {
    "training_files": "final_annotation_train.txt",
    "validation_files": "final_annotation_val.txt",
    "text_cleaners": [
      "chinese_cleaners2"
    ],
    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 3,
    "cleaned_text": true
  },
  "model": {
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "speakers": [
    "Qingzi",
    "specialweek",
    "zhongli"
  ],
  "symbols": [
    "_",
    ",",
    ".",
    "!",
    "?",
    "-",
    "~",
    "\u2026",
    "A",
    "E",
    "I",
    "N",
    "O",
    "Q",
    "U",
    "a",
    "b",
    "d",
    "e",
    "f",
    "g",
    "h",
    "i",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "p",
    "r",
    "s",
    "t",
    "u",
    "v",
    "w",
    "y",
    "z",
    "\u0283",
    "\u02a7",
    "\u02a6",
    "\u026f",
    "\u0279",
    "\u0259",
    "\u0265",
    "\u207c",
    "\u02b0",
    "`",
    "\u2192",
    "\u2193",
    "\u2191",
    " "
  ]
}

我只修改了其中的"text_cleaners",这个模型的json拿到的时候"text_cleaners"是"zh_ja_mixture_cleaners",我改成了"chinese_cleaners2",请问跟这个有关系吗?(其他地方没动过)

目前插件内没有zh_ja_mixture_cleaners这一选项,两种方法是不同的,所以出错了

这样呀,但是我需要怎样才能把zh_ja_mixture_cleaners转换为chinese_cleaners2,还是说这个模型就直接用不了了?

NAOLIU avatar May 16 '23 11:05 NAOLIU

如果在不改变代码内容的情况下那就是不能用了

dpm12345 avatar May 16 '23 11:05 dpm12345

如果在不改变代码内容的情况下那就是不能用了

我又生成了一个,是“chinese_cleaners”的,后面没有带2,我把json里面改成“chinese_cleaners2”应该也可以吧?

NAOLIU avatar May 18 '23 06:05 NAOLIU