Support for Mistral Audio Transcription API**
Clear and concise description of the problem
Hi!
I'm trying to use xsai for speech recognition with providers like Mistral, but have noticed it lacks support for the audio transcription endpoint (/v1/audio/transcriptions). My audio data comes from the client-side as a Blob or Buffer.
Desired Usage with xsai
A native transcribe function would be ideal, allowing for a unified API call.
// `audioBlob` is captured from a browser microphone or other client-side source
const audioFile = new File([audioBlob], 'recording.webm');
// Hypothetical function call
const { text } = await provider.transcribe({
model: 'voxtral-mini-latest',
file: audioFile
});
Current Workaround
For now, I'm using the official @mistralai/mistralai SDK, which works well but requires an additional dependency.
import { Mistral } from '@mistralai/mistralai';
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });
// `audioBuffer` is a Buffer derived from the client-side Blob
const response = await client.audio.transcriptions.complete({
model: "voxtral-mini-latest",
file: {
fileName: "audio.webm",
content: audioBuffer
}
});
console.log("Transcription:", response.text);
Add a utility to use transcription with the Mistral library, or have Xsai implement support.
Suggested solution / Ideas
ussing import { Mistral } from '@mistralai/mistralai';
Alternative
external app with this api rest?? example:
import { Hono } from 'hono';
import { Mistral } from '@mistralai/mistralai';
//import prompts from '../prompts/transcript.js';
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });
const router = new Hono();
// --- Function to encode to WAV ---
function encodeWAV(samples: Float32Array, sampleRate: number = 10000) {
const numChannels = 1;
const bitDepth = 32;
const dataBuffer = Buffer.alloc(samples.length * 4);
for (let i = 0; i < samples.length; i++) {
dataBuffer.writeFloatLE(samples[i], i * 4);
}
const header = Buffer.alloc(44);
const dataSize = dataBuffer.length;
const fileSize = dataSize + 36;
header.write('RIFF', 0);
header.writeUInt32LE(fileSize, 4);
header.write('WAVE', 8);
header.write('fmt ', 12);
header.writeUInt32LE(16, 16);
header.writeUInt16LE(3, 20); // IEEE float format
header.writeUInt16LE(numChannels, 22);
header.writeUInt32LE(sampleRate, 24);
const byteRate = sampleRate * numChannels * (bitDepth / 8);
header.writeUInt32LE(byteRate, 28);
const blockAlign = numChannels * (bitDepth / 8);
header.writeUInt16LE(blockAlign, 32);
header.writeUInt16LE(bitDepth, 34);
header.write('data', 36);
header.writeUInt32LE(dataSize, 40);
return Buffer.concat([header, dataBuffer]);
}
async function generateTranscript(audioBytes: Buffer): Promise<string> {
//const audioBase64 = audioBytes.toString("base64");
const transcriptions = await client.audio.transcriptions.complete({
model: "voxtral-mini-latest",
file: {
fileName: "audio.mp3",
content: audioBytes
}
});
console.log("Answer:", transcriptions);
return transcriptions.text || "No transcript returned";
}
// ✅ POST /base64 (unchanged)
router.post('/base64', async (c) => {
try {
const { audio } = await c.req.json();
if (!audio || typeof audio !== 'string') return c.text('Invalid base64', 400);
const buffer = Buffer.from(audio, 'base64');
const transcript = await generateTranscript(buffer);
return c.json({ transcript });
} catch (e) {
return c.text(`Error: ${searchError(e)}`, 500);
}
});
// ✅ POST /float32array (MODIFIED)
router.post('/float32array', async (c) => {
try {
const { audio, sampleRate } = await c.req.json();
if (!Array.isArray(audio)) {
return c.text('Invalid audio data: must be an array', 400);
}
const float32Array = new Float32Array(audio);
const wavBuffer = encodeWAV(float32Array, sampleRate);
const transcript = await generateTranscript(wavBuffer);
console.log("transcript", transcript);
return c.json({ transcript: transcript });
} catch (e) {
const errorMessage = searchError(e);
console.error('Error in /float32array:', errorMessage, e);
return c.text(`Error: ${errorMessage}`, 500);
}
});
// ✅ POST /buffer (unchanged – assumes the buffer is already in the correct format, e.g. a .wav file)
router.post('/buffer', async (c) => {
try {
const arrayBuffer = await c.req.arrayBuffer();
const buffer = Buffer.from(arrayBuffer);
if (!buffer || buffer.length === 0) return c.text('Empty buffer', 400);
const transcript = await generateTranscript(buffer);
return c.json({ transcript });
} catch (e) {
return c.text(`Error: ${searchError(e)}`, 500);
}
});
function searchError(c: Error | unknown) {
if (typeof c === 'string') {
return c;
} else if (c instanceof Error) {
console.error("Error object:", c);
return c.message;
} else {
console.error("Unknown error type:", c);
return 'Unknown error';
}
}
export default router;
Additional context
Voxtral Small and Mini are capable of answering questions directly from speech, or by providing an audio and a text-based prompt. https://mistral.ai/news/voxtral
Validations
- [x] Follow our Code of Conduct
- [x] Read the Contributing Guide.
- [x] Check that there isn't already an issue that request the same feature to avoid creating a duplicate.
As an alternative, we could perhaps let unSpeech to support Mistral Audio Transcription API. cc @nekomeowww
As an alternative, we could perhaps let unSpeech to support Mistral Audio Transcription API. cc @nekomeowww
It's fine, but to implement local or third-party models it would be better to have something more permissive.
Will add the support today.
Mistral do support standard open ai compatible audio transcription endpoint: https://docs.mistral.ai/api/#tag/ocr/operation/ocr_v1_ocr_post
Should be fixed with #415
Since #415 has been merged, I have closed this issue.