gTTS
gTTS copied to clipboard
0xA0 is causing gtts-cli to send EOF.
Prerequisites
- [*] Did you make sure a similar issue didn't exist?
- [*] Did you update gTTS to the latest? (
pip install --upgrade gTTS
)
Current Behaviour (steps to reproduce)
The presence of 0xA0
in the input text is mostly ignored by gtts-cli
. But in certain situations (the provided example) It will produce Error: 200 (OK) from TTS API. Probable cause: No audio stream in response. Unsupported language 'en'
along with EOF (And it seems to be redirected to stderr without actually having a python error).
$ gtts-cli -f test -o test.mp3
working_test.txt
non_working_test.txt
Even though the files contain 0xA0
which I assumed it will make the file a binary file. The file
command says the opposite.
$ file non_working_test.txt
non_working_test.txt: Unicode text, UTF-8 text
gtts-cli
didn't complain about none UTF-8 characters. And using iconv
to remove non utf-8 characters doesn't change anything.
$ iconv -f utf-8 -t utf-8 -c test
does nothing to the file.
And some web pages use that character in between the text. Most text editors show it as space. Which is a bit frustrating to the user (You almost have no clue what to do or what causes the error)
And I can not blame the creator of the page since it seems like (after searching online) 0xA0
is a part of windows-1252
encoding (So if he wrote his blog in microsoft word, there's a big chance it got introduced there).
Expected Behaviour
gtts-cli should ignore that character and continue reading regardless of how and where it is present.
Context
I am writing a simple bash script that reads aloud the user's clipboard or a webpage associated with the url in the user's clipboard.
I personally have been using this command w3m "$(xclip -o)" | gtts-cli -f - | mpv -
for over a year to boost productivity when reading. With some variations such less $pdf_file_or_epub_file | gtts-cli -f - | mpv -
and so on and so forth.
The script basically does the same (Still very basic and under development).
And I came accross some webpages that caused that error to occure. After Some investigations I found out that the character 0xA0
is what is causing the problem.
So I created an issue and made a small workaround that uses bbe
to replace the bad character with none (and then iconv
for clean up since it is messing up a couple of things).
Environment
$ gtts-cli --version
gtts-cli, version 2.2.4
$ python --version
Python 3.9.12
$ uname -a
Linux Laptop 5.17.3-tkg-pds #1 TKG SMP PREEMPT Sat Apr 16 06:53:55 CET 2022 x86_64 Intel(R) Celeron(R) N4000 CPU @ 1.10GHz GenuineIntel GNU/Linux
- OS: Gentoo/Linux x86_64
I assume this isn't gtts-cli
's fault. Since there's no actual python error. So I assume the problem is actually with the google text to speech engine. Yet the behavior itself is confusing. So I hope a fix will be applied.
@medanisjbara Thanks a lot for this well documented behaviour!
Hmm, so it's a windows-1252
character. I wonder if there's anything gTTS
should (or shouldn't do) about this, like applying some filtering. I'll have to take a look with the debugging on.