Long text strings produce incomplete audio files
I'm trying to use edge-tts to convert a chapter of a book into an audiobook. It's about 39k characters and around 7500 words. When I run it through edge-tts, the resulting audio file is often incomplete. At what point in the text it cuts off seems to be inconsistent and arbitrary, and every now and then it successfully produces audio for the entire text.
Any idea what's going wrong? Is this even a use case that's expected to work? (I wonder if Microsoft is limiting how much audio it'll generate for one request.)
I think it's related to an issue I've started encountering a month ago where the service randomly stops responding with audio data. It's a problem I've observed in the Edge browser as well.
I'm not sure how best to work around this but obviously a naive solution would be to retry a few times before accepting that the current split SSML doesn't have any audio data. Right now, if one of the split texts returns audio data; it doesn't raise an exception and considers it a success.
Do you have any luck with the latest release (6.1.10)?
Seems slightly better, but no luck. For context, I ran a generation 6 times on the a.txt file; there's about a megabyte or so missing in two of those files...
➜ edge-tts git:(master) ✗ wc a.txt
1738 40238 269540 a.txt
➜ edge-tts git:(master) ✗ ls -lh *.mp3
-rw-r--r-- 1 user user 87M Feb 17 00:02 a.mp3
-rw-r--r-- 1 user user 87M Feb 16 23:55 b.mp3
-rw-r--r-- 1 user user 85M Feb 17 00:03 c.mp3
-rw-r--r-- 1 user user 86M Feb 17 00:03 d.mp3
-rw-r--r-- 1 user user 87M Feb 17 00:02 e.mp3
-rw-r--r-- 1 user user 87M Feb 16 23:55 f.mp3
➜ edge-tts git:(master) ✗ ls -l *.mp3
-rw-r--r-- 1 user user 90403632 Feb 17 00:02 a.mp3
-rw-r--r-- 1 user user 90403632 Feb 16 23:55 b.mp3
-rw-r--r-- 1 user user 88253712 Feb 17 00:03 c.mp3
-rw-r--r-- 1 user user 89767152 Feb 17 00:03 d.mp3
-rw-r--r-- 1 user user 90403632 Feb 17 00:02 e.mp3
-rw-r--r-- 1 user user 90403632 Feb 16 23:55 f.mp3
6.1.10 stops running halfway through. I tested it with a 100,000-word text file and an MP3 with only around 80,000 words. However, 6.1.9 could run through the entire process. Yet, the subtitles generated by 6.1.9 only capture very little text.
The text file contains 250,000 words, while both the MP3 and the subtitles consist of only around 80,000 words.
6.1.10:
I remember previously the data for generating the MP3 would incrementally increase until completion, but now it goes from 0 directly to completion. I'm not sure if this is the reason for the issue.
6.1.9:
@expwise Thanks for the info, I'll attempt a workaround in a bit. For the time being, I guess you'll need to stick to 6.1.9 as it works better somehow. It's worth mentioning that both have issues, it just seems like in your case 6.1.10 is worse....
My theory is that it has to do with the fact that 6.1.10 switches to the next chunk of ~64KiB text immediately without creating a new connection whereas 6.1.9 emulates the Microsoft Edge behavior of starting from a new connection.
@rany2 Thank you for your efforts. Your project has been of great help to me. Well done!
What's the status of this issue? When I first reported it I was using 6.1.9.
@briankendall It's more complicated than I expected, the issue is that sometimes their API returns audio output partially on the same connection. So I can't just have a check on whether the current connection returned any audio and if not, retry; it's more complicated....
@rany2 Understood! I hope you can figure out a method for working around this.
Maybe a workaround could be edge-tts to chunk text files into workable sizes, run them individually, then splice them back at the end?
@lefnire we're doing that already, I tried different chunk sizes and I'm having the same issues regardless :(
@rany2 aw bummer. Thanks for the reply. I just went through the gamut: tortoise-tts, coqui-ai/tts, bark, edge-tts. Edge was victorious; but for this one bug. Tortoise is unusably slow (but great realism). Coqui & Bark can't take large files, nor did I find their voices realistic. edge-tts shocked me in terms of realism and speed. Here's hoping there's a solution somehow! Huge bummer Edge browser doesn't support to-file, without weird hoops (recording audio-out overnight kinda deal).
@lefnire not sure what you mean by to-file but you could actually save the mp3: https://github.com/rany2/edge-tts/blob/e58af9da76c7c7ba101c955ee1c2e98ce424f58f/examples/basic_generation.py#L19
@rany2 right right, I meant it's a shame that Microsoft Edge Browser doesn't do this natively. Hence a big value-add of this project.
Hey People,
I am struggling with the same piece, also like @lefnire for audiobook generation. Earlier I was able to produce several books without problems, nowadays its a huge struggle.
But, I might have found some partial solution.
In my case (python 3.9 on mac) I received errrors with
asyncio.exceptions.TimeoutError
mostly either it produced some audio (often not complete), or it gave that error after a few seconds. _therefore I upped the 'receive_timeout' in 'communicate.py' from 5 to 9000
def __init__(
self,
text: str,
voice: str = "Microsoft Server Speech Text to Speech Voice (en-US, AriaNeural)",
*,
rate: str = "+0%",
volume: str = "+0%",
pitch: str = "+0Hz",
proxy: Optional[str] = None,
receive_timeout: int = 9000,
):
this inhibited the above mentioned error. But I still struggled with inclomplete audiofiles....
I then looked into the 'aiohttp.ClientSession' documentation, and found that there is a timeout of 300 seconds (5 minutes).
My audiofiles where around 20 MB each, when they stopped being produced, and it took often about 5 minutes. After some iteration, I too changed this to 9000 seconds. (150 minutes):
# Create a new connection to the service.
ssl_ctx = ssl.create_default_context(cafile=certifi.where())
# By default aiohttp uses a total 300 seconds (5min) timeout,
# it means that the whole operation should finish in 5 minutes... (not long enough)
# ... therefore we extend this quite a lot.
timeout = aiohttp.ClientTimeout(total=9000)
async with aiohttp.ClientSession(
timeout = timeout,
trust_env=True,
) as session, session.ws_connect(
f"{WSS_URL}&ConnectionId={connect_id()}",
Since then it seems to work much better again – but not perfect!
I still get incomplete files, but less. What I observed for several files already: They got produced as incomplete 10 minutes after the initial creation of the file. This could hint towards a upper limit of the connection 'server-side' of 10 minutes. (the timeout could still be client-side). @rany2 I don't understand the software well enough. Is there an easy way to close the session after maybe 5 minutes and continue the text with a new session afterwards?
Disclaimer: I tried other things, e.g. reducing the threshold for the "chopping of the texts" from websocket_max_size: int = 2**16 to websocket_max_size: int = 2**12... this could have an effect too, but I don't think so. (as @rany2 already tested this anyways).
I also want to declare to not really understand the technicalities, and to have quite randomly selected the 9000 seconds.
As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier.
@rany2 thank you a lot for your software – I really enjoy using it for my usecase, and listened to audiobooks created with this tool for many hours already.
@tschnibo Thanks for researching and your kind words, I didn't know ClientSession had a timeout and never actually faced any timeout errors so I don't think it's related to this issue specifically. I'll try to look into your points to see if they get me any closer to a resolution.
It seems like the timeout value for ClientSession is a timeout for the entire operation, which seems like something we wouldn't want in this context because a generation might take a very long time. I'll most likely disable it all together and increase the receive_timeout to a minute.
As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier.
Makes sense.
@rany2 Thank you for your friendly response!
to illustrate the unfinished files yesterday, after applying this changes, it looked like this:
this timedifference between "created" and "last changed" of 10 minutes seems like a pattern.
disabling the ClientSession timeout seems like the right way to go, I totally agree. On the other hand, maybe one could define the timeout to be shorter than 10 minues, catch the timeout and proactively create a new session, or something like that – but this async-session-handling and OOP is not something which I easily see through – so I don't know what the easiest route would be. Maybe there is a way to just wait on the session to be terminated by the server, and then reconnect to a new session – but I don't know if this is actively communicated to the client by the server.
looking forward to watch the further development in this issue.
Can someone test if the version in master (not the one released in pypi) still has this issue?
Nevermind it's still inconsistent when it comes to this, but the first few runs were fine. I got my hopes up when it was working the first couple runs ):
tests/001-long-text_a.mp3 tests/001-long-text_b.mp3 differ: byte 781, line 3
tests/001-long-text_a.mp3 tests/001-long-text_g.mp3 differ: byte 27425332, line 110673
tests/001-long-text_a.srt tests/001-long-text_g.srt differ: byte 177684, line 5505
tests/001-long-text_a.mp3 tests/001-long-text_h.mp3 differ: byte 27425332, line 110673
tests/001-long-text_a.mp3 tests/001-long-text_m.mp3 differ: byte 781, line 3
tests/001-long-text_a.mp3 tests/001-long-text_r.mp3 differ: byte 781, line 3
tests/001-long-text_a.srt tests/001-long-text_r.srt differ: byte 175768, line 5441
tests/001-long-text_a.mp3 tests/001-long-text_z.mp3 differ: byte 781, line 3
tests/001-long-text_a.srt tests/001-long-text_z.srt differ: byte 87159, line 2693
@rany2 I really had to rise both of the timeouts much more, to only have this 10minutes timeout now. Have you tried with similarly excessive timeouts as I did?
maybe the first runs where "not further throttled" and then some sort of abuse-prevention on the server-side is activated, and this further slows the process?
@tschnibo but there's no way that receive_timeout would be more than a minute? it's for sock recv... are you sure? The receive_timeout now is controlling the receive for the low-level socket not websocket
@rany2 to be honest, I have no clue. Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning, for receive_timeoutthen I still had this asyncio.exceptions.TimeoutError. But yes, because I changed both values, I am not 100% sure which one had which effect.
when you would be able to reproduce this 10 minutes phenomenology ,maybe this would be indicative of some mechanism.
with one of my examples I looked at the submitted text, and the produced .vtt file, and also some of the websocket (I think), messages.
and it stopped somewhere in the middle of the submitted text, with returning messages:
{'type': 'WordBoundary', 'offset': 35077000000, 'duration': 1500000, 'text': 'that'} {'type': 'WordBoundary', 'offset': 35078625000, 'duration': 1500000, 'text': 'have'} {'type': 'WordBoundary', 'offset': 35080250000, 'duration': 6500000, 'text': 'significant'} {'type': 'WordBoundary', 'offset': 35086875000, 'duration': 7750000, 'text': 'implications'} {'type': 'WordBoundary', 'offset': 35094750000, 'duration': 1125000, 'text': 'for'}
... and then it starts with the next text, for the next file.
I think it would be interesting to monitor the connection and see if there is some sort of termination message.
see if there is some sort of termination message
There isn't unfortunately :(
Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning
Could you test the current version in master and see if you still get timeouts? The parameter now sets a timeout for socket recv, previously it was controlling the time it needs to get a websocket message response.
Yes, I'll try to test... just doing this besides working a completely different job, cannot plan on when I accomplish the testing.
@rany2 in order to make my task easier, I patched my existing installation with your changes, I hope I have done this correctly – the first few files went flawlessly, but now, also the chapters are maybe getting longer again (or some throttling kicks in a again), it just had displayed this 10 min cutoff again, with the unfinished processing.
The next chapter went alright again (with a 34 MB audio generated, in 4 minutes), the next one cancelled after 10 minutes and 18 MB again... as did the next few chapters, until a much shorter chapter, which completed fine.
so for me it seems like the behavior stays the same as with my extended timeouts, in terms of the files either being correctly (and maybe rather quickly) generated, or the process (is slower and) quits after 10 minutes for large texts, and might be successfull for shorter texts.
I didn't have any timeout-errors like in the pypi version...
@tschnibo so you're saying that the defaults right now don't need any adjusting?