jigasi
jigasi copied to clipboard
Encoding problem: Lack of UTF-8 Support for JSON POST Requests in Transcription Module
Description
Transcripted languages appear as '?' other than english at SEND_JSON_REMOTE_URLS of jigasi module like (other than English) Hindi, it's crucial to ensure that the content is being sent and received using the UTF-8 character encoding to avoid any misinterpretation of characters.
Current behavior
whenever i have spoken in hindi, it hasnât understood the non-ASCII and posted the '?' in my streams.
when sending the JSON data to the server, the character encoding is not explicitly set. By default, it might be using the system's default character encoding which might not be UTF-8
Expected Behavior
The transcription text should correctly represent the spoken content in any supported language without encoding issues.
Possible Solution
I've created a pull request that addresses this issue by ensuring the Content-Type header for JSON POST requests is explicitly set to application/json; charset=UTF-8. Additionally, I've ensured that the JSON string is converted to bytes using UTF-8 encoding before sending.
PR Link: https://github.com/jitsi/jigasi/pull/504
Steps to reproduce
- Set up Jigasi with transcription service.
- Use the transcription feature with a non-ASCII language, e.g., Hindi.
- Observe the returned transcription text containing unexpected characters or question marks.
-
org.jitsi.jigasi.transcription.SEND_JSON_REMOTE_URLS=<remote json accepting url>