orange icon indicating copy to clipboard operation
orange copied to clipboard

Feature Suggestion: Add AI subtitles

Open janwilmake opened this issue 11 months ago • 3 comments

Having subtitles would be great for language learning...

According to the documentation, you can receive real-time transcripts of the audio through the response.audio_transcript.delta server events. This happens concurrently while receiving the audio stream.

For WebRTC connections, the documentation mentions that during a session you'll receive:

  • input_audio_buffer.speech_started events when input starts
  • input_audio_buffer.speech_stopped events when input stops
  • response.audio_transcript.delta events for the in-progress audio transcript
  • response.done event when the model has completed transcribing and sending a response

This means you can get word-by-word transcription updates as the audio is being processed, allowing you to build features like real-time captions or text displays alongside the voice interaction.

The transcription events are part of the standard event lifecycle whether you're using WebRTC or WebSocket connections, so you'll have access to the transcript regardless of which connection method you choose.

We can probably create a component like this:

import React, { useState, useEffect } from 'react';
import { useRoomContext } from '~/hooks/useRoomContext';
import type { ClientMessage, User } from '~/types/Messages';

const AiSubtitles = () => {
  const [subtitles, setSubtitles] = useState('');
  const [isVisible, setIsVisible] = useState(false);
  const { room } = useRoomContext();

  // Record AI speech activity
  const recordActivity = (user: User) => {
    if (user.id === 'ai' && user.speaking) {
      setIsVisible(true);
      // Here we'd need the actual transcript from the AI service
      // For now, we'll just show a speaking indicator
      setSubtitles("AI is speaking...");
    } else {
      setIsVisible(false);
      setSubtitles('');
    }
  };

  if (!isVisible) return null;

  return (
    <div className="fixed bottom-24 left-1/2 -translate-x-1/2 w-full max-w-2xl mx-auto px-4">
      <div className="bg-black/75 text-white p-4 rounded-lg text-center text-lg animate-fadeIn">
        {subtitles}
      </div>
    </div>
  );
};

export default AiSubtitles;

to support subtitles, and render it by processing the realtime API response and including this component in /app/routes/_room.$roomName.room.tsx

janwilmake avatar Dec 20 '24 17:12 janwilmake

Great idea!

One blocking issue right now though is, that OpenAI only supports receiving one audio stream. That is the reason why our demo app currently requires to use a "push-to-talk" button to talk to the AI. This ensures that only the person which presses the button is being forwarded/heard to the AI.

So right now we could only get subtitles for the one person which currently talks to the AI. Which sounds a little limited to me and not what you are suggesting, or?

But yes as soon as OpenAI adds support for receiving multiple audio streams from all the participants in the meeting this would become a cool and useful feature.

nils-ohlmeier avatar Dec 20 '24 18:12 nils-ohlmeier

Imo it's already great to see the subtitles of what the AI says back.

Especially useful if it speaks in a language not super familiar to you.

However I understand your point. Having subtitles for everybody would be a killer feature! The only solution I could imagine would be to add a different AI for each speaker, and just have them all be silently listening to their paired speaker, not responding.

janwilmake avatar Dec 20 '24 18:12 janwilmake

Great idea!

One blocking issue right now though is, that OpenAI only supports receiving one audio stream. That is the reason why our demo app currently requires to use a "push-to-talk" button to talk to the AI. This ensures that only the person which presses the button is being forwarded/heard to the AI.

So right now we could only get subtitles for the one person which currently talks to the AI. Which sounds a little limited to me and not what you are suggesting, or?

But yes as soon as OpenAI adds support for receiving multiple audio streams from all the participants in the meeting this would become a cool and useful feature.

Hi! Quick question—do you know how to access the response.audio_transcript.delta events that @janwilmake mentioned from the OpenAI documentation when using WebRTC?

I couldn’t find a detailed example for WebRTC setups specifically—most references I’ve seen relate to WebSocket streams. If you’ve managed to capture those transcript events in a WebRTC context (like during a real-time voice session), I’d love to know how to hook into them.

Any pointers would be super helpful

josegmez avatar Apr 18 '25 06:04 josegmez