airi icon indicating copy to clipboard operation
airi copied to clipboard

Support for Mistral Audio Transcription API**

Open nglmercer opened this issue 8 months ago • 4 comments

Clear and concise description of the problem

Hi!

I'm trying to use xsai for speech recognition with providers like Mistral, but have noticed it lacks support for the audio transcription endpoint (/v1/audio/transcriptions). My audio data comes from the client-side as a Blob or Buffer.

Desired Usage with xsai

A native transcribe function would be ideal, allowing for a unified API call.

// `audioBlob` is captured from a browser microphone or other client-side source
const audioFile = new File([audioBlob], 'recording.webm');

// Hypothetical function call
const { text } = await provider.transcribe({
  model: 'voxtral-mini-latest',
  file: audioFile
});

Current Workaround

For now, I'm using the official @mistralai/mistralai SDK, which works well but requires an additional dependency.

import { Mistral } from '@mistralai/mistralai';

const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });

// `audioBuffer` is a Buffer derived from the client-side Blob
const response = await client.audio.transcriptions.complete({
  model: "voxtral-mini-latest",
  file: {
    fileName: "audio.webm",
    content: audioBuffer
  }
});

console.log("Transcription:", response.text);

Add a utility to use transcription with the Mistral library, or have Xsai implement support.

Suggested solution / Ideas

ussing import { Mistral } from '@mistralai/mistralai';

Alternative

external app with this api rest?? example:

import { Hono } from 'hono';
import { Mistral } from '@mistralai/mistralai';
//import prompts from '../prompts/transcript.js';
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });
const router = new Hono();

// --- Function to encode to WAV ---
function encodeWAV(samples: Float32Array, sampleRate: number = 10000) {
  const numChannels = 1;
  const bitDepth = 32;

  const dataBuffer = Buffer.alloc(samples.length * 4);
  for (let i = 0; i < samples.length; i++) {
    dataBuffer.writeFloatLE(samples[i], i * 4);
  }

  const header = Buffer.alloc(44);
  const dataSize = dataBuffer.length;
  const fileSize = dataSize + 36;

  header.write('RIFF', 0);
  header.writeUInt32LE(fileSize, 4);
  header.write('WAVE', 8);
  header.write('fmt ', 12);
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(3, 20);               // IEEE float format
  header.writeUInt16LE(numChannels, 22);
  header.writeUInt32LE(sampleRate, 24);
  const byteRate = sampleRate * numChannels * (bitDepth / 8);
  header.writeUInt32LE(byteRate, 28);
  const blockAlign = numChannels * (bitDepth / 8);
  header.writeUInt16LE(blockAlign, 32);
  header.writeUInt16LE(bitDepth, 34);
  header.write('data', 36);
  header.writeUInt32LE(dataSize, 40);

  return Buffer.concat([header, dataBuffer]);
}

async function generateTranscript(audioBytes: Buffer): Promise<string> {
  //const audioBase64 = audioBytes.toString("base64");

  const transcriptions = await client.audio.transcriptions.complete({
    model: "voxtral-mini-latest",
    file: {
      fileName: "audio.mp3",
      content: audioBytes
    }
  });
  console.log("Answer:", transcriptions);
  return transcriptions.text || "No transcript returned";
}

// ✅ POST /base64 (unchanged)
router.post('/base64', async (c) => {
  try {
    const { audio } = await c.req.json();
    if (!audio || typeof audio !== 'string') return c.text('Invalid base64', 400);
    const buffer = Buffer.from(audio, 'base64');
    const transcript = await generateTranscript(buffer);
    return c.json({ transcript });
  } catch (e) {
    return c.text(`Error: ${searchError(e)}`, 500);
  }
});

// ✅ POST /float32array (MODIFIED)
router.post('/float32array', async (c) => {
  try {
    const { audio, sampleRate } = await c.req.json();
    if (!Array.isArray(audio)) {
      return c.text('Invalid audio data: must be an array', 400);
    }
    const float32Array = new Float32Array(audio);
    const wavBuffer = encodeWAV(float32Array, sampleRate);

    const transcript = await generateTranscript(wavBuffer);
    console.log("transcript", transcript);
    return c.json({ transcript: transcript });
  } catch (e) {
    const errorMessage = searchError(e);
    console.error('Error in /float32array:', errorMessage, e);
    return c.text(`Error: ${errorMessage}`, 500);
  }
});

// ✅ POST /buffer (unchanged – assumes the buffer is already in the correct format, e.g. a .wav file)
router.post('/buffer', async (c) => {
  try {
    const arrayBuffer = await c.req.arrayBuffer();
    const buffer = Buffer.from(arrayBuffer);
    if (!buffer || buffer.length === 0) return c.text('Empty buffer', 400);

    const transcript = await generateTranscript(buffer);
    return c.json({ transcript });
  } catch (e) {
    return c.text(`Error: ${searchError(e)}`, 500);
  }
});

function searchError(c: Error | unknown) {
  if (typeof c === 'string') {
    return c;
  } else if (c instanceof Error) {
    console.error("Error object:", c);
    return c.message;
  } else {
    console.error("Unknown error type:", c);
    return 'Unknown error';
  }
}

export default router;

Additional context

Voxtral Small and Mini are capable of answering questions directly from speech, or by providing an audio and a text-based prompt. https://mistral.ai/news/voxtral

Validations

  • [x] Follow our Code of Conduct
  • [x] Read the Contributing Guide.
  • [x] Check that there isn't already an issue that request the same feature to avoid creating a duplicate.

nglmercer avatar Aug 05 '25 03:08 nglmercer

As an alternative, we could perhaps let unSpeech to support Mistral Audio Transcription API. cc @nekomeowww

kwaa avatar Aug 07 '25 06:08 kwaa

As an alternative, we could perhaps let unSpeech to support Mistral Audio Transcription API. cc @nekomeowww

It's fine, but to implement local or third-party models it would be better to have something more permissive.

nglmercer avatar Aug 07 '25 06:08 nglmercer

Will add the support today.

nekomeowww avatar Aug 07 '25 06:08 nekomeowww

Mistral do support standard open ai compatible audio transcription endpoint: https://docs.mistral.ai/api/#tag/ocr/operation/ocr_v1_ocr_post

Should be fixed with #415

skirkru avatar Aug 27 '25 02:08 skirkru

Since #415 has been merged, I have closed this issue.

kwaa avatar Nov 13 '25 06:11 kwaa