With Microsoft Edge online voices, NVDA's continuous reading stops after the first sentence
Greetings, and thanks for your awesome efforts. The problem is that NVDA can't use Microsoft Edge online voices in continuous reading. As continuous reading starts with NVDA+Down, the first sentence is read, but reading stops after that. So, in effect, continuous reading can't be used in NVDA With Microsoft Edge online voices. I've tested NVDA 2024.2 Release Candidate and NVDA 2024.3 alphas. It doesn't affect the offline natural voices. Can something be done to take care of this issue?
Hello,
I have the same problem and wanted to open it on GitHub, so @amirsol81 thanks for opening it.
@gexgd0419 Thank you for allowing us to use natural TTS voices in screen readers and other applications. Do you have any progress or answer on this issue?
NVDA inserts bookmarks into the text to be spoken. When a bookmark is reached during speaking, the TTS engine/voice will tell NVDA the name of the bookmark, so NVDA will know the current speaking progress.
This engine do support bookmarks. It can pass the bookmarks to the natural voices and notify the TTS client (NVDA) when a bookmark is reached, if this is supported by the natural voice you are using.
Local Narrator natural voices and Azure natural voices support bookmarks and other features. However, Microsoft Edge voices only support a very limited subset of features, such as volume/rate/pitch adjustment. Bookmarks, unfortunately, are not supported by Edge voices.
Since the Edge voice server will just close the connection immediately if there's any unsupported SSML tag, this engine will remove all unsupported tags before sending the SSML to the server. As a result, you can still hear the text being spoken, but all unsupported elements will be lost. (#2 is another example)
Seemingly NVDA relies on bookmarks to know when a part is completed so that it can continue speaking the next part. When using Edge online voices, text can still be spoken, but bookmarks aren't supported, so NVDA will never know when the current part is completed.
There is a possible solution though. Edge voices don't support bookmarks, but they do support word boundary events to tell the client when each word is currently being spoken. NVDA isn't using word boundary events unfortunately, but maybe my engine could be made to simulate bookmark events, based on the supported word boundary events.
maybe my engine could be made to simulate bookmark events, based on the supported word boundary events
This works! With simulated bookmark events, NVDA continuous reading can work correctly. Online voices can be slow, though.
The fix will be in the next release version.
@gexgd0419 Mega thanks for your efforts, and really looking forward to the next release. So can this fix also help other related issues? For instance, if we use Edge online voices, pressing ALT+Tab or pressing Windows+M to reach the Desktop makes NVDA to just say, Desktop. NVDA doesn't read the focused item on the Desktop. I guess this is yet another bookmark-related issue with NVDA.
I can confirm the same behaviour with NVDA when it is on desktop, and also the same with continuous reading. Btw you said that it might slow down the whole reading process when online voices are being used. Is there any way to compare how slow that is, and could it be speed up somehow?
@gexgd0419 It's worth mentioning that the Kurzweil 1000, a relatively old app by today's standards, interacts very well with Microsoft online voices, and they can be used for continuous reading in the K1000 flawlessly. Quite interestingly, the K1000 can even generate high-quality MP3 files with those voices.
@gexgd0419 Any idea as to when we'll get the engine update? I very much like to use the update with NVDA for continuous reading.
@gexgd0419 Greetings. I'm running Windows 11 24H2 now, and the SAPI Adapter engine works flawlessly like what it did with Win 11 23H2. Still looking forward to the next release of the engine for NVDA-specific fixes.
A new version v0.2 has been released! This version should fix the issue.
@gexgd0419 Thanks! I can confirm that NVDA is now properly supported - it's awesome! However, I've seen 2 issues.
-
With MS Edge online voices, longer pauses are placed between sentences in NVDA in continuous reading. It's a bit strange because I get no pauses in, say, the Kurzweil 1000 in continuous reading. What do you think might generate these relatively longer pauses in NVDA?
-
Quite interestingly, V0.2 seems to have broken JAWS support for MS Edge online voices. With V0.2, attempting to read via JAWS - either continuously or line by line - will result in many words being glued to one another, producing nonsensical pronunciations. V0.1 had no issues in this regard. I'm testing with JAWS 2024 on my work machine/ Windows 11 24H2, and will perform more tests at home.
Thanks again for your efforts.Message ID: @.***>
@gexgd0419 I did more tests with JAWS on my Home machine which is using the V0.1 of the engine - not V0.2. I've updated my home machine to Windows 11 24H2, and here JAWS has issues with the SAPI engine regardless of its version. So it seems to be related to Windows 11 24H2 not your SAPI engine as JAWS was working properly with V0.1 of the engine and Win 11 23H2 before of the upgrade. Since I'm not a JAWS users, I hadn't noticed the regression with the Win 11 upgrade.
With MS Edge online voices, longer pauses are placed between sentences in NVDA in continuous reading.
NVDA breaks the text to be spoken into sentences, and sends the sentences to the TTS engine one at a time. So the TTS engine knows nothing about the next sentence until the current sentence is finished.
This is fine for a local TTS voice with little delay. But for online voices, establishing a network connection to the server, sending requests to the server, and receiving data from the server, all take some time. On my Internet connection, this can usually add about 200ms delay between each sentence. So I guess that's what the long pauses are from.
attempting to read via JAWS - either continuously or line by line - will result in many words being glued to one another
I just checked the SSML generated by this engine when using JAWS.
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>
To<bookmark mark='5'/>navigate<bookmark mark='6'/>press<bookmark mark='7'/>Up<bookmark mark='8'/>or<bookmark mark='9'/>Down<bookmark mark='10'/>Arrow.
</speak>
Then I realized that <bookmark> tags can not only mark positions, but also separate words.
Edge voices don't support bookmarks, so bookmark tags in SSML are removed, and the bookmark events are synthesized by analyzing the word boundary events. However, now the SSML becomes:
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-US'>
TonavigatepressUporDownArrow.
</speak>
without any space between the words.
So I think that this can be fixed by adding spaces in the places where the bookmarks should be. The fix will be in the next release.
@gexgd0419 Thanks for the brilliant and eye-opening explanations. I really like V0.2's compatibility with NVDA and use it a lot for reading articles. Also looking forward to the next release for the JAWS fix though I myself don't use JFW much.
@gexgd0419 Thank you very much for your work. Now NVDA reads everything perfectly.
attempting to read via JAWS - either continuously or line by line - will result in many words being glued to one another
A new version v0.2.1 has been released, which should fix this issue.