ESP32-audioI2S icon indicating copy to clipboard operation
ESP32-audioI2S copied to clipboard

Can you help me write the code to call Doubao TTS?

Open Explorerlowi opened this issue 1 year ago • 4 comments

My programming skills are so poor that I really can’t do it (灬ꈍ ꈍ灬). Here is the relevant document: https://www.volcengine.com/docs/6561/79823 I will pay you a certain amount of compensation. Thank you very much if you can do it!

Explorerlowi avatar Sep 26 '24 16:09 Explorerlowi

The content returned by Doubao TTS request is not directly a binary audio stream. Its audio data is stored in the data field of a json structure. It is base64 encoded data. Binary audio data can only be obtained after base64 decoding. In this case, how to play it 5227de0688c11667cb49d6be8e00967d image

Explorerlowi avatar Oct 07 '24 19:10 Explorerlowi

bool Audio2::connectToDoubaoTTS(const char *text) { xSemaphoreTakeRecursive(mutex_audio, portMAX_DELAY);

setDefaults();

const char *host = "openspeech.bytedance.com";
const char *api_url = "/api/v1/tts";

const char *appid = "82505*****";
const char *access_token = "WZCBgLbSd-ltw5gDeKvEYX9M******";
const char *cluster = "volcano_tts";
const char *voice_type = "BV001_streaming";

// Create JSON request
DynamicJsonDocument doc(1024); // Adjust size as necessary
JsonObject app = doc.createNestedObject("app");
app["appid"] = appid;
app["token"] = access_token;
app["cluster"] = cluster;

JsonObject user = doc.createNestedObject("user");
user["uid"] = "388808087185088";

JsonObject audio = doc.createNestedObject("audio");
audio["voice_type"] = voice_type;
audio["encoding"] = "mp3";
audio["speed_ratio"] = 1.0;
audio["volume_ratio"] = 1.0;
audio["pitch_ratio"] = 1.0;

JsonObject request = doc.createNestedObject("request");
request["reqid"] = String(uuid()); // Generate UUID
request["text"] = text;
request["text_type"] = "plain";
request["operation"] = "query";
request["with_frontend"] = 1;
request["frontend_type"] = "unitTson";

// Prepare JSON payload
String json_payload;
serializeJson(doc, json_payload);

// Connect to the server
_client = static_cast<WiFiClientSecure *>(&clientsecure);
if (!_client->connect(host, 443)) { // Use 443 for HTTPS
    log_e("Connection failed");
    xSemaphoreGiveRecursive(mutex_audio);
    return false;
}

// Create and send HTTP POST request
_client->println("POST " + String(api_url) + " HTTP/1.1");
_client->println("Host: " + String(host));
_client->println("Authorization: Bearer; " + String(access_token));
_client->println("Content-Type: application/json");
_client->println("Content-Length: " + String(json_payload.length()));
_client->println(); // End of headers

// Send JSON payload
_client->print(json_payload);

Serial.println(json_payload);
// Read the response
/*String response = "";
while (_client->connected() || _client->available()) {
    if (_client->available()) {
        char c = _client->read();
        response += c;
    }
}

// Process the response
if (response.indexOf("\"data\"") != -1) {
    // Parse the JSON response to get the data
    DynamicJsonDocument responseDoc(1024); // Adjust size as needed
    deserializeJson(responseDoc, response);
    const char* data = responseDoc["data"];
    // Here you would base64 decode the data and handle the audio
    // Remember to consider the necessary libraries or methods to handle audio output
} else {
    log_e("No data in response");
}

_client->stop();*/
m_streamType = ST_WEBFILE;
Serial.print("play speech: ");
Serial.println(m_streamType);
isplaying = 1;
m_f_running = true;
m_f_ssl = false;
m_f_tts = true;
setDatamode(HTTP_RESPONSE_HEADER);
xSemaphoreGiveRecursive(mutex_audio);
return true;

}

// Method to generate UUID (simple implementation) String Audio2::uuid() { uint32_t uid = esp_random(); // Random number as a placeholder for UUID generation return String(uid, HEX); }

This is my current code.

Explorerlowi avatar Oct 07 '24 19:10 Explorerlowi

With a "normal audio stream", the data would be written to the buffer here. InBuff.getWritePtr() is the pointer to the position from which the data is written bytesAddedToBuffer contains the number of bytes actually written. Then the conversion from base64 would have to be done. image

You don't need to worry about the rest, if it is an MP3 stream, for example, the ID3 header is automatically loaded when the buffer is full enough and the file is played.

schreibfaul1 avatar Oct 07 '24 20:10 schreibfaul1

Can you teach me how to parse and play the returned audio stream after sending an http(s) request to TTS? For example, which functions will work after the content is returned, and how will the returned content be processed? I can only play Baidu TTS now. When I send a request to Doubao TTS, Ali TTS, etc., I cannot parse and play the returned content normally.

Explorerlowi avatar Oct 11 '24 15:10 Explorerlowi

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Nov 11 '24 02:11 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Nov 25 '24 02:11 github-actions[bot]