media-streams
media-streams copied to clipboard
Converting Twilio live stream from WebSocket to the supported audio format to Microsoft Azure using NodeJs
Hello, please I am trying to integrate the Twilio live media stream with Microsoft Azure STT in order to get a live transcription of the user input. My problem at the moment is I am unable to convert the payload to the wave/PCM format which is supported by azure. I saw a similar solution on this topic here (https://www.twilio.com/blog/live-transcription-media-streams-azure-cognitive-services-java) but the issue is this is using Java programming language while I am trying to do this with NodeJs. Can you please help
below is the code I am using
const WebSocket = require("ws")
const express = require("express")
const app = express();
const server = require("http").createServer(app)
const path = require("path")
const base64 = require("js-base64");
const alawmulaw = require('alawmulaw');
const wss = new WebSocket.Server({ server })
//Include Azure Speech service
const sdk = require("microsoft-cognitiveservices-speech-sdk")
const subscriptionKey = '2195XXXXXXXXXXXXXXXXXX'
const serviceRegion = 'southeastasia'
// Hard code the variables
//const variables = require("./config/variables")
const language = "en-US"
const azurePusher = sdk.AudioInputStream.createPushStream(sdk.AudioStreamFormat.getWaveFormatPCM(8000, 16, 1))
const audioConfig = sdk.AudioConfig.fromStreamInput(azurePusher);
const speechConfig = sdk.SpeechConfig.fromSubscription(subscriptionKey, serviceRegion);
speechConfig.speechRecognitionLanguage = language;
speechConfig.enableDictation();
const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);
recognizer.recognizing = (s, e) => {
console.log(`RECOGNIZING: Text=${e.result.text}`);
};
recognizer.recognized = (s, e) => {
if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
console.log(`RECOGNIZED: Text=${e.result.text}`);
}
else if (e.result.reason == sdk.ResultReason.NoMatch) {
console.log("NOMATCH: Speech could not be recognized.");
}
};
recognizer.canceled = (s, e) => {
console.log(`CANCELED: Reason=${e.reason}`);
if (e.reason == sdk.CancellationReason.Error) {
console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
console.log("CANCELED: Did you update the key and location/region info?");
}
recognizer.stopContinuousRecognitionAsync();
};
recognizer.sessionStopped = (s, e) => {
console.log("\nSession stopped event.");
recognizer.stopContinuousRecognitionAsync();
};
recognizer.startContinuousRecognitionAsync(() => {
console.log("Continuous Reco Started");
},
err => {
console.trace("err - " + err);
recognizer.close();
recognizer = undefined;
});
// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");
ws.on("message", function incoming(message) {
const msg = JSON.parse(message);
switch (msg.event) {
case "connected":
break;
case "start":
console.log(`Starting Media Stream ${msg.streamSid}`);
break;
case "media":
var streampayload = base64.decode(msg.media.payload)
var data = Buffer.from(streampayload)
var pcmdata = Buffer.from(alawmulaw.mulaw.decode(data))
//console.log(msg.mediaFormat.encoding)
// process.stdout.write(msg.media.payload + " " + " bytes\033[0G");
// streampayload = base64.decode(msg.media.payload, 'base64');
// let data = Buffer.from(streampayload);
azurePusher.write(pcmdata)
break;
case "stop":
console.log(`Call Has Ended`);
azurePusher.close()
recognizer.stopContinuousRecognitionAsync()
break;
}
});
})
app.post("/", (req, res) => {
res.set("Content-Type", "text/xml");
res.send(
`<Response>
<Say>
Leave a message
</Say>
<Start>
<Stream url="wss://${req.headers.host}" />
</Start>
<Pause legnth ='60' />
</Response>`
)
});
console.log("Listening at Port 8080");
server.listen(8080);
Please help in converting the media payload which comes in mu-law format to the supported PCM format by Microsoft Azure for Speech to text transcription.
I'm also facing the same problem
-
The transcription is accurate if I stream audio chunks from an audio file.
-
Getting random few words If I stream audio chunks from Twilio calls.
I'm also facing the same problem
- The transcription is accurate if I stream audio chunks from an audio file.
- Getting random few words If I stream audio chunks from Twilio calls.
I dont have so much knowledge on low level things but what i notice is when i save the ulaw format of twilio to wav format and try to play it. it will work perfectly but when i try to send it azure that file audio chunks for continousrecogniztion it doesn't work's but when i again convert that wav file in to a 16khz 8bit depth mono through the external websites it and give it again to azure it seems to work perfectly them so what iam trying to say it something we're doing wrong while conversion. it seems fine and working but still something is missing
Any solution?
i checked the java example here https://www.twilio.com/blog/live-transcription-media-streams-azure-cognitive-services-java and converted the MulawToPcm class to nodejs and started using it , and it's working for me
/**
-
This class contains a single public method for mapping an array of 8-bit
-
µ-law values to a 16-bit linear PCM values.
-
This is needed because Twilio media-streams only produces µ-law encoded audio
-
data and some cloud speech-to-text engines only accept PCM. */ export class MulawToPcm { private static readonly mulawMapping: Int16Array = new Int16Array([ 32124, 31100, 30076, 29052, 28028, 27004, 25980, 24956, 23932, 22908, 21884, 20860, 19836, 18812, 17788, 16764, 15996, 15484, 14972, 14460, 13948, 13436, 12924, 12412, 11900, 11388, 10876, 10364, 9852, 9340, 8828, 8316, 7932, 7676, 7420, 7164, 6908, 6652, 6396, 6140, 5884, 5628, 5372, 5116, 4860, 4604, 4348, 4092, 3900, 3772, 3644, 3516, 3388, 3260, 3132, 3004, 2876, 2748, 2620, 2492, 2364, 2236, 2108, 1980, 1884, 1820, 1756, 1692, 1628, 1564, 1500, 1436, 1372, 1308, 1244, 1180, 1116, 1052, 988, 924, 876, 844, 812, 780, 748, 716, 684, 652, 620, 588, 556, 524, 492, 460, 428, 396, 372, 356, 340, 324, 308, 292, 276, 260, 244, 228, 212, 196, 180, 164, 148, 132, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 0, -32124, -31100, -30076, -29052, -28028, -27004, -25980, -24956, -23932, -22908, -21884, -20860, -19836, -18812, -17788, -16764, -15996, -15484, -14972, -14460, -13948, -13436, -12924, -12412, -11900, -11388, -10876, -10364, -9852, -9340, -8828, -8316, -7932, -7676, -7420, -7164, -6908, -6652, -6396, -6140, -5884, -5628, -5372, -5116, -4860, -4604, -4348, -4092, -3900, -3772, -3644, -3516, -3388, -3260, -3132, -3004, -2876, -2748, -2620, -2492, -2364, -2236, -2108, -1980, -1884, -1820, -1756, -1692, -1628, -1564, -1500, -1436, -1372, -1308, -1244, -1180, -1116, -1052, -988, -924, -876, -844, -812, -780, -748, -716, -684, -652, -620, -588, -556, -524, -492, -460, -428, -396, -372, -356, -340, -324, -308, -292, -276, -260, -244, -228, -212, -196, -180, -164, -148, -132, -120, -112, -104, -96, -88, -80, -72, -64, -56, -48, -40, -32, -24, -16, -8, 0 ]);
/**
-
Converts a Uint8Array of µ-law encoded audio data to PCM encoded.
-
@param mulawBytes Uint8Array of 8-bit µ-law values
-
@return Uint8Array of 16-bit PCM values. Each byte of µ-law
-
converts to 2 bytes of PCM, so the output array is twice
-
as long as the input. Pairs of PCM bytes are little-endian
-
ie least-significant byte is the first in the pair */ public static transcode(buffer : Buffer): Uint8Array { let mulawBytes :Uint8Array = this.toArrayBuffer(buffer); const output = new Uint8Array(mulawBytes.length * 2);
for (let i = 0; i < mulawBytes.length; i++) { // +128 because Java byte values are signed and array indices start from 0 const pcmData: number = this.mulawMapping[mulawBytes[i] + 128];
// least-significant byte first output[2 * i] = pcmData & 0xff; // most-significant byte second output[2 * i + 1] = pcmData >> 8;
}
return output; }
private static toArrayBuffer(buffer : Buffer) : Uint8Array { const view = new Uint8Array(buffer.length); for (let i = 0; i < buffer.length; ++i) { view[i] = buffer[i]; } return view; } }
-