sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

请教一个关于音素转换的问题

Open AFun9 opened this issue 3 weeks ago • 1 comments

大佬们,我之前一直在想能否实现一个混合语言的tts,后面发现这个对我批来说比较困难,于是我收集了两种不同语言的数据集,一种英语,一种俄语,这两个数据集都是同一个人的声音,这是为了保证音色的一致性,通过改造icefall下的ljspeech,训练了两个语言的tts模型,我想的是再预处理阶段对文本切割,将不同的语言分别传入对应的tts模型,然后将生成的音频合并,这样就能实现我的目标,但是就算是训练的数据是一致的,但是模型产出的声音区别还是很大。除了这个问题,还有个问题不好解决,举个例子 Здравствуйте hello (中文都是你好的意思) 拼接后我发现这个音频不连贯,就是听起来不对劲,当然最重要的是音色会有区别,尽管是同一个人的数据集训练的。后面我想到了一个办法能够实现低配版的混合语言的tts。本质上,我们这个模型是对音素进行建模,理论上说我们的模型能够生成任何语言的音频,因为所有语言的单词都可以转换为音素,唯一的问题在于不同语言的音素会有区别,就拿英语和俄语来说,英语的tokens有一百多个,而俄语的tokens只有英语的一半左右(这个只是我的训练过程产生的token,音素都是使用piper_phonemize生成)。为了实现俄语的tts能够生成英语单词的音频,我的办法很简单,就是再text处理的时候也做分片,俄语的还是走lang=ru,英语的还是走lang=en-us,但是英语走完这个后还会做一个处理,我将英语的音素映射到俄语的音素上,这一步能够解决俄语不认识英语音素的问题,但是有个问题,我觉得是个小问题,就是说的英语会带有俄语的口音,因为英语的一些发音在俄语中是没有的,有点像外国人说中文的感觉,我让AI帮我做的映射,感觉问题并不是很大,只要这个映射表够专业应该是完全可以的。(不知道为什么我的音频传不上来)

我的问题是,无论是你们文档中提到的python api还是java api,在生成音频的时候都没有传递lang这个参数,然后我就去看源码,发现你们是隐式传递的,这个模型在转换的时候会加入一个meta_info,这个类似于一个说明文档的东西 meta_data = { "model_type": "vits", "version": "1", "model_author": "k2-fsa", "comment": "icefall", # must be icefall for models from icefall "language": "English", "voice": "en-us", # Choose your language appropriately "has_espeak": 1, "n_speakers": num_speakers, "sample_rate": model.model.sampling_rate, # Must match the real sample rate } 你们的源码是通过这个meta_data.voice来传递的,并没有开放给用户。 `GeneratedAudio Generate( const std::string &text, int64_t sid = 0, float speed = 1.0, GeneratedAudioCallback callback = nullptr) const override { const auto &meta_data = model->GetMetaData(); int32_t num_speakers = meta_data.num_speakers;

if (num_speakers == 0 && sid != 0) {

#if OHOS SHERPA_ONNX_LOGE( "This is a single-speaker model and supports only sid 0. Given sid: " "%{public}d. sid is ignored", static_cast<int32_t>(sid)); #else SHERPA_ONNX_LOGE( "This is a single-speaker model and supports only sid 0. Given sid: " "%d. sid is ignored", static_cast<int32_t>(sid)); #endif }

if (num_speakers != 0 && (sid >= num_speakers || sid < 0)) {

#if OHOS SHERPA_ONNX_LOGE( "This model contains only %{public}d speakers. sid should be in the " "range [%{public}d, %{public}d]. Given: %{public}d. Use sid=0", num_speakers, 0, num_speakers - 1, static_cast<int32_t>(sid)); #else SHERPA_ONNX_LOGE( "This model contains only %d speakers. sid should be in the range " "[%d, %d]. Given: %d. Use sid=0", num_speakers, 0, num_speakers - 1, static_cast<int32_t>(sid)); #endif sid = 0; }

std::string text = text; if (config.model.debug) {

#if OHOS SHERPA_ONNX_LOGE("Raw text: %{public}s", text.c_str()); #else SHERPA_ONNX_LOGE("Raw text: %s", text.c_str()); #endif }

if (!tn_list_.empty()) { for (const auto &tn : tn_list_) { text = tn->Normalize(text); if (config_.model.debug) {

#if OHOS SHERPA_ONNX_LOGE("After normalizing: %{public}s", text.c_str()); #else SHERPA_ONNX_LOGE("After normalizing: %s", text.c_str()); #endif } } }

std::vector<TokenIDs> token_ids = frontend_->ConvertTextToTokenIds(text, meta_data.voice);

if (token_ids.empty() || (token_ids.size() == 1 && token_ids[0].tokens.empty())) { SHERPA_ONNX_LOGE("Failed to convert %s to token IDs", text.c_str()); return {}; }

std::vector<std::vector<int64_t>> x; std::vector<std::vector<int64_t>> tones;

x.reserve(token_ids.size());

for (auto &i : token_ids) { x.push_back(std::move(i.tokens)); }

if (!token_ids[0].tones.empty()) { tones.reserve(token_ids.size()); for (auto &i : token_ids) { tones.push_back(std::move(i.tones)); } }

// TODO(fangjun): add blank inside the frontend, not here if (meta_data.add_blank && config_.model.vits.data_dir.empty() && meta_data.frontend != "characters") { for (auto &k : x) { k = AddBlank(k); }

for (auto &k : tones) { k = AddBlank(k); } }

int32_t x_size = static_cast<int32_t>(x.size());

if (config_.max_num_sentences <= 0 || x_size <= config_.max_num_sentences) { auto ans = Process(x, tones, sid, speed); if (callback) { callback(ans.samples.data(), ans.samples.size(), 1.0); } return ans; }

// the input text is too long, we process sentences within it in batches // to avoid OOM. Batch size is config_.max_num_sentences std::vector<std::vector<int64_t>> batch_x; std::vector<std::vector<int64_t>> batch_tones;

int32_t batch_size = config_.max_num_sentences; batch_x.reserve(config_.max_num_sentences); batch_tones.reserve(config_.max_num_sentences); int32_t num_batches = x_size / batch_size;

if (config_.model.debug) {

#if OHOS SHERPA_ONNX_LOGE( "Text is too long. Split it into %{public}d batches. batch size: " "%{public}d. Number of sentences: %{public}d", num_batches, batch_size, x_size); #else SHERPA_ONNX_LOGE( "Text is too long. Split it into %d batches. batch size: %d. Number " "of sentences: %d", num_batches, batch_size, x_size); #endif }

GeneratedAudio ans;

int32_t should_continue = 1;

int32_t k = 0;

for (int32_t b = 0; b != num_batches && should_continue; ++b) { batch_x.clear(); batch_tones.clear(); for (int32_t i = 0; i != batch_size; ++i, ++k) { batch_x.push_back(std::move(x[k]));

if (!tones.empty()) {
  batch_tones.push_back(std::move(tones[k]));
}

}

auto audio = Process(batch_x, batch_tones, sid, speed); ans.sample_rate = audio.sample_rate; ans.samples.insert(ans.samples.end(), audio.samples.begin(), audio.samples.end()); if (callback) { should_continue = callback(audio.samples.data(), audio.samples.size(), (b + 1) * 1.0 / num_batches); // Caution(fangjun): audio is freed when the callback returns, so users // should copy the data if they want to access the data after // the callback returns to avoid segmentation fault. } }

batch_x.clear(); batch_tones.clear(); while (k < static_cast<int32_t>(x.size()) && should_continue) { batch_x.push_back(std::move(x[k])); if (!tones.empty()) { batch_tones.push_back(std::move(tones[k])); }

++k; }

if (!batch_x.empty()) { auto audio = Process(batch_x, batch_tones, sid, speed); ans.sample_rate = audio.sample_rate; ans.samples.insert(ans.samples.end(), audio.samples.begin(), audio.samples.end()); if (callback) { callback(audio.samples.data(), audio.samples.size(), 1.0); // Caution(fangjun): audio is freed when the callback returns, so users // should copy the data if they want to access the data after // the callback returns to avoid segmentation fault. } }

return ans;

}` 这是你们源码的部分,你们好像把这整个生成过程封装了,从处理到音频生成写在了一个接口里面,而且这个piper_phonemize是一个纯python库,还没有一直到java中,所以说,能不能把这个接口重写一下呢?我只是看得懂C++,不太会写了,抱歉。

AFun9 avatar Nov 01 '25 08:11 AFun9