NaturalVoiceSAPIAdapter icon indicating copy to clipboard operation
NaturalVoiceSAPIAdapter copied to clipboard

Request for Higher Audio Quality Configuration Support

Open LeXwDeX opened this issue 4 months ago • 6 comments

On Windows 11, when using Narrator with 24kHz audio output, we’ve observed occasional distortion or popping artifacts in specific scenarios. To improve playback quality, would it be possible to introduce a configuration option that enables Narrator to output audio at 48kHz?

This enhancement would help ensure clearer and more stable audio performance, especially in environments sensitive to sound fidelity.

LeXwDeX avatar Aug 07 '25 09:08 LeXwDeX

If you are using offline natural voices, then unfortunately the only supported format is 24 kHz. This is a limitation documented in embedded speech.

If you are using online voices, then it might be possible to use a higher-quality format. I chose the 24kHz 48kbit/s mono MP3 format just because it's convenient and supported by Edge and Azure voices. Edge voices have more limitations, so not all formats are supported.

I wonder, how much quality improvement will it be from 24kHz to 48kHz? The current 24kHz format sounds decent to me. Could you give me some examples of "occasional distortion or popping artifacts" that are easy to reproduce?

gexgd0419 avatar Aug 07 '25 11:08 gexgd0419

If you are using offline natural voices, then unfortunately the only supported format is 24 kHz. This is a limitation documented in embedded speech.

If you are using online voices, then it might be possible to use a higher-quality format. I chose the 24kHz 48kbit/s mono MP3 format just because it's convenient and supported by Edge and Azure voices. Edge voices have more limitations, so not all formats are supported.

I wonder, how much quality improvement will it be from 24kHz to 48kHz? The current 24kHz format sounds decent to me. Could you give me some examples of "occasional distortion or popping artifacts" that are easy to reproduce?

The difference in audio quality is noticeable even on regular headphones. I checked the code and found that with some minor adjustments to the decoder and request parameters, support for 48kHz output is possible (though I’m not sure if this will actually be utilized by Windows Narrator). I also compiled my own install.exe and x64 DLL to register for testing.

My main use case is enabling TTS in the game World of Warcraft, where I’ve noticed some strange popping noises. Subjectively, after making the modifications mentioned above, the popping sounds have been reduced.

LeXwDeX avatar Aug 08 '25 01:08 LeXwDeX

Are you using online voices?

Could you tell me the voice and the new parameters you are using, so that I can experiment and implement it myself?

gexgd0419 avatar Aug 08 '25 01:08 gexgd0419

Are you using online voices?

Could you tell me the voice and the new parameters you are using, so that I can experiment and implement it myself?

Azure Online TTS Server:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-text-to-speech?tabs=streaming#audio-outputs

NaturalVoiceSAPIAdapter\SpeechRestAPI.cpp

// Send configuration and wait for audio data response
void SpeechRestAPI::SendRequest(const WSConnectionPtr& conn)
{
	m_allDataReceived = false;

	nlohmann::json json = {
		{"context", {
			{"synthesis", {
				{"audio", {
					{"metadataOptions", {
						{"bookmarkEnabled", (bool)BookmarkCallback},
						{"punctuationBoundaryEnabled", (bool)PunctuationBoundaryCallback},
						{"sentenceBoundaryEnabled", (bool)SentenceBoundaryCallback},
						{"wordBoundaryEnabled", (bool)WordBoundaryCallback},
						{"visemeEnabled", (bool)VisemeCallback},
					}},
					{"outputFormat", "audio-48khz-192kbitrate-mono-mp3"}
				}},
				{"language", {
					{"autoDetection", false}
				}}
			}}
		}}
	};

NaturalVoiceSAPIAdapter\TTSEngine.cpp

STDMETHODIMP CTTSEngine::GetOutputFormat(const GUID* /*pTargetFormatId*/, const WAVEFORMATEX* /*pTargetWaveFormatEx*/,
    GUID* pDesiredFormatId, WAVEFORMATEX** ppCoMemDesiredWaveFormatEx) noexcept
{
    // Azure 48kHz
    return SpConvertStreamFormatEnum(SPSF_48kHz16BitMono, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);
}

LeXwDeX avatar Aug 08 '25 06:08 LeXwDeX

Since you are using this in World of Warcraft, is #37 a problem for you? Someone reported that World of Warcraft 11.2 is not working with the latest version of this engine.

gexgd0419 avatar Aug 08 '25 09:08 gexgd0419

Since you are using this in World of Warcraft, is #37 a problem for you? Someone reported that World of Warcraft 11.2 is not working with the latest version of this engine.

I used the LUA API in World of Warcraft and didn’t encounter this issue.

I came across this popping sound issue quite a while ago. I think I’ll try switching the audio from a streaming RESTful API to a one-time full file request to test it.

LeXwDeX avatar Aug 09 '25 06:08 LeXwDeX