nodejs-docs-samples icon indicating copy to clipboard operation
nodejs-docs-samples copied to clipboard

Speech client v2, my, working OK, example

Open sorokinvj opened this issue 1 year ago • 11 comments

Hey guys, I am not sure where to put it, but I just want to share my implementation of the speech client v2 and some thoughts about migrating from v1 to v2. Official docs are unfortunately provide only python code, and there is not much info except this repo and the example I used to migrate: transcribeStreaming.v2.js

My sdk verison is:

"@google-cloud/speech": "^6.0.1",

In my code I first initialize the service as:

const service = createGoogleService({ language, send })

and then use service.transcribeAudio(data) whenever there is a new audio coming from the frontend, which uses

const mediaRecorder = new MediaRecorder(audioStream, { mimeType: 'audio/webm;codecs=opus' }) // its a default param;
mediaRecorder.ondataavailable = (event: BlobEvent) => {
... send the event.data to the backend
}

thus an audio chunk is just a browser Blob object.

My service:

import { logger } from '../..//logger';
import { getText, transformGoogleResponse } from './utils';
import { v2 as speech } from '@google-cloud/speech';
import { StreamingRecognizeResponse } from './google.types';
import { TranscriptionService } from '../transcription.types';
import { MachineEvent } from '../../websocket/websocket.types';
import { Sender } from 'xstate';
import { parseErrorMessage } from '../../../utils';
import { findRecognizerByLanguageCode } from './recognizers';

export const createGoogleService = ({
  language,
  send,
}: {
  language: string;
  send: Sender<MachineEvent>;
}): Promise<TranscriptionService> => {
  return new Promise((resolve, reject) => {
    try {
      const client = new speech.SpeechClient({
        keyFilename: 'assistant-demo.json',
      });

      const recognizer = findRecognizerByLanguageCode(language).name;

      const configRequest = {
        recognizer,
        streamingConfig: {
          config: {
            autoDecodingConfig: {},
          },
          streamingFeatures: {
            enableVoiceActivityEvents: true,
            interimResults: false,
          },
        },
      };

      logger.info('Creating Google service with recogniser:', recognizer);

      const recognizeStream = client
        ._streamingRecognize()
        .on('error', error => {
          logger.error('Error on "error" in recognizeStream', error);
          send({ type: 'ERROR', data: parseErrorMessage(error) });
        })
        .on('data', (data: StreamingRecognizeResponse) => {
          if (data.speechEventType === 'SPEECH_ACTIVITY_END') {
            send({ type: 'SPEECH_END', data: 'SPEECH_END' });
          }
          if (data.results.length > 0) {
            const transcription = transformGoogleResponse(data);
            if (transcription) {
              const transcriptionText = getText(transcription);
              if (!transcriptionText?.length) {
                // if the transcription is empty, do nothing
                return;
              }
              send({ type: 'NEW_TRANSCRIPTION', data: transcriptionText });
            }
          }
        })
        .on('end', () => {
          logger.warn('Google recognizeStream ended');
        });

      let configSent = false;

      const transcribeAudio = (audioData: Buffer) => {
        if (!configSent) {
          recognizeStream.write(configRequest);
          configSent = true;
        }
        recognizeStream.write({ audio: audioData });
      };

      const stop = () => {
        if (recognizeStream) {
          recognizeStream.end();
        }
      };

      resolve({ stop, transcribeAudio });
    } catch (error) {
      logger.error('Error creating Google service:', error);
      reject(error);
    }
  });
};

Migration considerations

  1. to use v2 you need to create a recognizer, I did it with this function:
/**
 * Creates a new recognizer.
 *
 * @param {string} projectId - The ID of the Google Cloud project.
 * @param {string} location - The location for the recognizer.
 * @param {string} recognizerId - The ID for the new recognizer.
 * @param {string} languageCode - The language code for the recognizer.
 * @returns {Promise<object>} The created recognizer.
 * @throws Will throw an error if the recognizer creation fails.
 */
export const createRecognizer = async (
  projectId: string,
  location: string,
  recognizerId: string,
  languageCode: string
) => {
  const client = new v2.SpeechClient({
    keyFilename: 'assistant-demo.json',
  });

  const request = {
    parent: `projects/${projectId}/locations/${location}`,
    recognizer: {
      languageCodes: [languageCode],
      model: 'latest_long',
      // Add any additional configuration here
    },
    recognizerId,
  };

  try {
    console.log('Creating recognizer...', request);
    const [operation] = await client.createRecognizer(request);
    const [recognizer] = await operation.promise();
    return recognizer;
  } catch (error) {
    console.error('Failed to create recognizer:', error);
    throw error;
  }
};
  1. The config object now should be sent as first data to the stream object, immediately before the audio, so if you did recognizingClient.write(audioData) before, now you should do (but only once!)recognizingClient.write(newConfigWithRecognizer) and then recognizingClient.write({ audio: audioData }) <<< notice the object notation
  2. The config object itself has been changed to:
public streamingConfig?: (google.cloud.speech.v2.IStreamingRecognitionConfig|null);

/** Properties of a StreamingRecognitionConfig. */
interface IStreamingRecognitionConfig {

** StreamingRecognitionConfig config */
config?: (google.cloud.speech.v2.IRecognitionConfig|null);

/** StreamingRecognitionConfig configMask */
configMask?: (google.protobuf.IFieldMask|null);

/** StreamingRecognitionConfig streamingFeatures */
streamingFeatures?: (google.cloud.speech.v2.IStreamingRecognitionFeatures|null);
}
  1. When instantiating streamingClient use _streamingRecognize() (this probably is likely to be changed)

sorokinvj avatar Nov 28 '23 15:11 sorokinvj

Does this still work for you?

I created my recognizer in cloud console and am just specifying it as a string. Im however getting the error "Error: 3 INVALID_ARGUMENT: Malordered Data Received. Expected audio but none was set. Send exactly one config, followed by audio data.".

I really cant figure out what im doing wrong...

Edit: @sorokinvj I fixed it so the audio data can send but get back no response from STT. The stream ends up timing out. I have interimResults to true.

adambeer avatar Jan 15 '24 18:01 adambeer

Yeah, all of this is because you are somehow sending wrong audio data, if you share your code I might get more insight. @adambeer sorry, I saw it just now. My code runs production server, no problem so far

sorokinvj avatar Mar 23 '24 17:03 sorokinvj

I get code: 7, details: 'The caller does not have permission', metadata: Metadata { internalRepr: Map(0) {}, options: {} }

when I execute recognizeStream.write(streamingRecognizeRequest);

This my code:

const config: IRecognitionConfig = { languageCodes: ['de-DE'], model: 'chirp', autoDecodingConfig: {}, };

const streamingRecognitionConfig: google.cloud.speech.v2.IStreamingRecognitionConfig = { config: config, streamingFeatures: { interimResults: true, }, };

const streamingRecognizeRequest: google.cloud.speech.v2.IStreamingRecognizeRequest = { recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_, streamingConfig: streamingRecognitionConfig, };

const recognizeStream = client ._streamingRecognize() .on('error', (err) => { console.error(err); }) .on('data', async (data) => { console.log('Data:', data); }); recognizeStream.write(streamingRecognizeRequest); for (let i = 0; i < buffer.length; i += 1024) { // to byte array const data = Uint8Array.from(buffer.slice(i, i + 1024)); recognizeStream.write({ audioContent: data }); } recognizeStream.end();`

While a non-streaming request works well: const recognitionOutputConfig: IRecognitionOutputConfig = { gcsOutputConfig: { uri:${STAGING_BUCKET_URL}/transcriptions/`, }, };

const request: RecognizeBatchRequest = { processingStrategy: 'DYNAMIC_BATCHING', toJSON(): { [p: string]: any } { return {}; }, recognitionOutputConfig, recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_, config: config, // content: buffer, files: [ { uri: ${STAGING_BUCKET_URL}/audio_recordings/${audioRecordingId}, }, ], }; `

Any idea why? the service account i use has cloud speech admin permissions. Is it maybe because of my location asia-southeast1? When I change the model to "short" or "long" I get 'The language "auto" is not supported by the model "short" in the location named "asia-southeast1".', Happens for any other language.

PS: sorry for the non-code formatting.

MilanHofmann avatar Mar 29 '24 14:03 MilanHofmann

I am getting Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request. I cant track it down. Can you please help. Here are relevant parts of code:

import { v2 as speech } from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

const request = { recognizer:'projects/redacted/locations/us-central1/recognizers/redacted', streamingConfig: { config: { languageCode: 'en-US', }, streamingFeatures: { enableVoiceActivityEvents: true, interimResults: false, }, }, };

const recognizeStream = speechClient.streamingRecognize(request)

NOTE: I am on node 18.

carstarai avatar Apr 30 '24 17:04 carstarai

I am getting Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request. I cant track it down. Can you please help. Here are relevant parts of code:

import { v2 as speech } from '@google-cloud/speech';

const speechClient = new speech.SpeechClient();

const request = { recognizer:'projects/redacted/locations/us-central1/recognizers/redacted', streamingConfig: { config: { languageCode: 'en-US', }, streamingFeatures: { enableVoiceActivityEvents: true, interimResults: false, }, }, };

const recognizeStream = speechClient.streamingRecognize(request)

NOTE: I am on node 18.

I don't know, man, sorry. Your code is doing something else than mine, but to understand your usecase I would need to see more than what you've shared.

  1. You are using .streamingRecognize vs. _streamingRecognize in my code. I am not sure I understand what is your function doing.
  2. Moreover, your setup looks like you want to instantiate a stream calling streamingRecognize once? For streaming you want to write continously some chunks of data, which you are missing.

sorokinvj avatar May 01 '24 13:05 sorokinvj

const config: IRecognitionConfig = { languageCodes: ['de-DE'], model: 'chirp', autoDecodingConfig: {}, };

Can you try to not use any model and leave this field undefined?

const streamingRecognizeRequest: google.cloud.speech.v2.IStreamingRecognizeRequest = { recognizer: projects/${process.env.FIREBASE_PROJECT_ID}/locations/asia-southeast1/recognizers/_, streamingConfig: streamingRecognitionConfig, };

Ensure FIREBASE_PROJECT_ID is set

recognizeStream.write(streamingRecognizeRequest)

It might be useful to have a basic check in the code, that you send this request only once in the first packet, though I doubt it has anything to do with permission issue. Sorry, have no idea why you might stuck with this.

sorokinvj avatar May 01 '24 13:05 sorokinvj

Thank you for your time. I got the config set correctly. I was also able to get a stream from a local file to transcribe correctly. However, I am getting no response on this. const parsedMessage = JSON.parse(decodedMessage);

  if (parsedMessage.event === "media" && parsedMessage.media) {
    
    const decodedPayload = Buffer.from(parsedMessage.media.payload, 'base64');
    recognizeStream.write({audio:decodedPayload});
    

This is a base64 encoded payload coming from a websocket message. The config sends properly but there is clear some sort of problem with the buffer and audio data. I verified the incoming data and that the data is sent to _streamingRecognize. These lines worked in v1 by the way.

No error codes. Just no response then timeout after cancelling the project.

UPDATE: I just realized the other is only giving a response on stream end. I think this is my problem. this needs to transcribe voice data via a websocket.

carstarai avatar May 02 '24 19:05 carstarai

@sorokinvj Im calling it from a google function onCall with the code below and Im receiving Unhandled error Error: 3 INVALID_ARGUMENT: Invalid resource field value in the request.

I tried without the model, but still the same problem

const client = new speech.v2.SpeechClient();

    const config: speech.protos.google.cloud.speech.v2.IRecognitionConfig = {
      languageCodes: ['en-US'],
      features: {
        "profanityFilter": true,
        "enableAutomaticPunctuation": true,
        "enableSpokenEmojis": true,
      },
      autoDecodingConfig: {},
    };
    const speechRequest: speech.protos.google.cloud.speech.v2.IRecognizeRequest = {
      content: request.data.audio,
      config: config,
    };
  
    const [response] = await client.recognize(speechRequest);'

rodrifmed avatar May 02 '24 21:05 rodrifmed

UPDATE I could make it work with:

const client = new speech.v2.SpeechClient({
        apiEndpoint: `us-central1-speech.googleapis.com`,
    });

    const config: speech.protos.google.cloud.speech.v2.IRecognitionConfig = {
        languageCodes: ["en-US"],
        model: "long",
        features: {
            profanityFilter: true,
            enableAutomaticPunctuation: true,
            enableSpokenEmojis: true,
        },
        autoDecodingConfig: {},
    };

const speechRequest: speech.protos.google.cloud.speech.v2.IRecognizeRequest = {
        recognizer: "projects/{{FIREBASE_PROJECT_ID}}locations/us-central1/recognizers/_",
        content: request.data.audio,
        config: config,
    };

rodrifmed avatar May 03 '24 15:05 rodrifmed

I can get transcriptions but the problem I am having is that transcriptions only come back after the stream has ended

carstarai avatar May 03 '24 16:05 carstarai

I know the models offer mulaw encoding but everytime i set mulaw with a recognizer in console it flips back to linear16. Anyone able to help?

carstarai avatar May 06 '24 18:05 carstarai