PeerTube Support captioning of Lives during broadcast for hugely improved accessibility.

Describe the problem to be solved Peertube's Live system is a really powerful and useful feature. Peertube's subtitle support for stored videos (and the ability to add them after the fact) is also hugely important for accessibility. However, to truly make Peertube an accessible platform, the Live player should also support auto-captioned subtitles that can be turned on or off by the viewer.

Describe the solution you would like: There is good live captioning support in tools such as OBS. Alongside human-supplied captioning, there's a very solid plugin that generates very fast auto / AI captioning and streams the captions alongside video content. When captions are detected in a livestream, Peertube should offer to display them as per a non-live video. This would make Peertube significantly more accessible to both individuals who require either audio or visual assistance to fully participate as a viewer of Peertube's lives.

Oct 28 '21 14:10 shibco

Possibility to use Whisper for this task?

Mar 19 '23 02:03 EchedelleLR

Possibility to use Whisper for this task?

I heard that transcoding 2 minutes of video need 1 minute on a Desktop with GPU. This depends very much on the performance of the machine.

Therefore, it is more practical to do this on upload side.

Jul 20 '23 03:07 iacore

Therefore, it is more practical to do this on upload side.

The tool linked by the issue author does not add captioning client-side, it uses the Google Cloud Speech Recognition API which itself uses something like Whisper server-side. Performing ML tasks (like speech recognition) client-side is not practical at all at the moment.

Aug 08 '23 16:08 nfbyte

Possibility to use Whisper for this task?

This is not an issue asking for ML transcription. It is an issue asking for Peertube to support subtitles that are streamed alongside video content. Currently, the only way to do this is to 'bake' the subtitle into the livestream by compositing it as a source/layer from your livestream source. This issue describes a method that treats streamed subtitles in the same way as non-Live Peertube videos that have subtitle files. Whether it uses Whisper or any other ML Speech to Text service is not relevant for the issue, because the ideal solution in this case is source agnostic.

Edit to add: Whether this is done as part of the Peertube live, or supplied by the livestreamer (as, say, the OBS plugin linked in my original issue) isn't as relevant right now as the fact that Peertube can't display subtitles on a livestream.

Aug 08 '23 17:08 shibco

It is an issue asking for Peertube to support subtitles that are streamed alongside video content.

I think this would be a hack, as at the moment any client-side solution will likely rely on external (potentially proprietary / non-free) APIs anyway. The ideal source agnostic way to add captioning (for both videos and livestreams) is server-side in PeerTube, as I mention in #5931.

Aug 09 '23 08:08 nfbyte

What I wanted to highlight in my previous comment was that I do not think Peertube should be doing the ML translation itself. I think we are broadly saying a similar sort of thing here, except using different examples.

Aug 09 '23 09:08 shibco

Creating a real-time client-side speech-to-text with translation for a live stream involves several steps and technologies. Below is a high-level overview of how you could achieve this using HTML, JavaScript, and relevant APIs:

Set Up the Webpage: Create an HTML page that includes the necessary elements for capturing audio and displaying the transcribed and translated text.
Capture Audio: Use the Web Speech API to capture audio from the user's microphone. The SpeechRecognition object can be used to start and stop capturing audio.

const recognition = new SpeechRecognition();
recognition.start();
recognition.onresult = (event) => {
  const transcript = event.results[0][0].transcript;
  // Handle the transcript (speech-to-text).
};

Speech-to-Text: Extract the transcribed text from the captured audio using the Web Speech API. You can then display this text on your webpage.
Translation: For translation, you can use a translation API like Google Cloud Translation, Microsoft Translator, or DeepL. You'll need to sign up for an API key and integrate it into your JavaScript code.

const translationApiKey = 'YOUR_TRANSLATION_API_KEY';
const sourceLanguage = 'en';  // Source language code (English)
const targetLanguage = 'fr';  // Target language code (French)

// Make a translation request to the API
async function translateText(text) {
  const response = await fetch(`https://translation.googleapis.com/language/translate/v2?key=${translationApiKey}&source=${sourceLanguage}&target=${targetLanguage}&q=${encodeURIComponent(text)}`);
  const data = await response.json();
  const translatedText = data.data.translations[0].translatedText;
  return translatedText;
}

Real-time Update: Whenever new audio is transcribed, call the translation function and update the translated text on the webpage in real-time.

recognition.onresult = async (event) => {
  const transcript = event.results[0][0].transcript;
  const translatedText = await translateText(transcript);
  // Update the UI with the transcribed and translated text.
};

Web Socket (Optional): For a smoother real-time experience, consider using WebSockets to stream the transcribed and translated text to the viewers of the live stream.

Cconsider performance implications for real-time audio processing.

Aug 09 '23 11:08 ROBERT-MCDOWELL

Maybe there is a way to somehow reuse the code in this plugin for this? https://gitlab.com/apps_education/peertube/plugin-transcription

Aug 16 '23 18:08 chagai95

For comparison: Twich requires captions to be streamed alongside the video. Twitch accepts captions in line 21 CEA-708/EIA-608 format and in CC1 NTSC field 1.

From their docs:

Captions may be transmitted using one of the following methods:

CEA-708/EIA-608 embedded in the video elementary stream as described in ATSC A/72 (SEI user_data). This format is common among television broadcast encoders.

CEA-708/EIA-608 transmitted via RTMP onCaptionInfo script/AMF0 tag. This format is common among Internet broadcast encoders and media servers such as Elemental Technologies and Wowza.

When transmitting via RTMP, the payload must contain an ECMA Array with two element pairs:

A string named "type" containing the characters "708"

A string named "data" that contains a base64 encoded CEA-708/EIA-608 payload

https://help.twitch.tv/s/article/guide-to-closed-captions?language=en_US#HowtoUseLiveClosedCaptionsforBroadcasters

It would probably be a good idea to use those same standards in Peertube.

Sep 25 '24 13:09 candidexmedia