generate-video-subtitle
generate-video-subtitle copied to clipboard
Using Google Cloud Speech-To-Text API
Generate subtitle for a video
[TOC]
Requirement
- Adding a subtitle for a video automatically
- Extract sound from video
- Input: video
- Format: any video format allowed by FFmpge
- Size: no longer than 1min, if it is longer than 1min, we may upload output audio into Google Cloud Storage, and use
gcs_uriin convert speech step with Google Cloud Speech-To-Text API
- Output: audio
- Format:
- Ecoding: FLAC
- sample_rate_hertz=16000 or more
- Language=Mandarin
- Format:
- Input: video
- Convert speech into text
- Input: audio
- Format: above output audio format
- Output: text
- Format: Plain text and
strfile
- Format: Plain text and
- Input: audio
- Generate subtitle using above source
.strfile is the target subtitle file
- Extract sound from video
Component
1. Extract sound from video using FFmpeg
-
Install:
-
$ brew install ffmpeg -
Usage:
ffmpeg -i video.mp4 -f mp3 -ab 192000 -vn audio.mp3-iinput file-fconvert to format, flac is recommanded by Google Cloud Speech-To-Tex API-arconver to sampleRateHertz, 16000 is recommanded by Google Cloud Speech-To-Tex API-vnoutput file is not video-ac1 only 1 channel audio would be allowed by Cloud Speech
-
Allow format
ffmpeg -formatsD 3dostr 3DO STR E 3g2 3GP2 (3GPP2 file format) E 3gp 3GP (3GPP file format) D 4xm 4X Technologies E a64 a64 - video for Commodore 64 D aa Audible AA format files D aac raw ADTS AAC (Advanced Audio Coding) DE ac3 raw AC-3 D acm Interplay ACM ...
2. Convert audio into text using Google Cloud Speech-To-Text API
-
Install: follow official website Set up a GCP Console projec, Set the environment variable GOOGLE_APPLICATION_CREDENTIALS and install and initialize Google Cloud SDK
Install the client library (for python):
pip3 install --upgrade google-cloud-speech -
In this project, our test video is longer than 1 min, we need the asynchronously transcribes and use
gcs_urifor speech file -
we refer the following code from official website
-
# [START speech_transcribe_async_gcs] def transcribe_gcs(gcs_uri): """Asynchronously transcribes the audio file specified by the gcs_uri.""" from google.cloud import speech from google.cloud.speech import enums from google.cloud.speech import types client = speech.SpeechClient() audio = types.RecognitionAudio(uri=gcs_uri) config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='en-US') operation = client.long_running_recognize(config, audio) print('Waiting for operation to complete...') response = operation.result(timeout=90) # Each result is for a consecutive portion of the audio. Iterate through # them to get the transcripts for the entire audio file. for result in response.results: # The first alternative is the most likely one for this portion. print(u'Transcript: {}'.format(result.alternatives[0].transcript)) print('Confidence: {}'.format(result.alternatives[0].confidence)) # [END speech_transcribe_async_gcs] -
In config,
-
We use as following:
- sample_rate_hertz=16000
- language_code='zh'
- encoding='FLAC'
- this config would set several phrases for specific vedio "Savvy _June Cut_final.mp4"
- enable_word_time_offsets=True, which include timestamp used for generate subtitle
- enable_automatic_punctuation=True, which include punctuation in transcript field
-
config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='zh', speech_contexts=[ speech.types.SpeechContext(phrases=[ '思睿', '在思睿', '海外教育', '双师', '辅导', '授课', '云台录播', '讲义', '赢取' ]) ], enable_word_time_offsets=True, enable_automatic_punctuation=True)
-
Usage
extract-audio.py
python3 extract-audio.py Savvy\ _June\ Cut_final.mp4
-
This script would recogonize whether the file exists, whether the format allowed by
ffmpeg -
Furthermore, this script can handle the input name with whitespace and output coutain original file name
-
The output audio is obtain each config which needed by Google Cloud Speech-To-Text API(see: Component2)
-
The output file name would be
audio-inputFileName.flac, which would also be upload intogs://test-convert-audio/audio-inputFileName.flac(used in convert step)
audio-to-text.py
For shorter audio (no longer 1 min) using synchronous speech recognition.
python3 audio-to-text.py localFile.flac
For longer audio (longer than 1 min) using asynchronous speech recognition.
python3 audio-to-text.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
Note: if filename with whitespace plase use ""
Step1: convert to text
This script would send a recognize request to Cloud Speech-to-Text and obtain the response, and we can also write into a plain text in output folder ( transcript-text.txt)
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=90)
Step2: format text as subtitle file .srt
This script add sequence number for each line and timpesamp for each line (e.g 0:00:1.012 —> 0: 00: 3.211) and the words, which format is need by .str file
helper modual timestr.py would help us convert the start_timeor end_time as allowed string, start_timeor end_time both are from response information
Because the response.result.alternatives[0].word only contain word information, so reading output file transcript-text.txt, which including punctuation, and add punctuation into subtitle file or leave white space at punctuation position.
Note
-
There provide two version audio-to-text.py available
-
Subtitle with punctuation
audio-to-text-with-punctuation.py:-
python3 audio-to-text-with-punctuation.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
-
-
Subtitle no punctuation
audio-to-text-no-punctuation.py-
python3 audio-to-text-no-punctuation.py "gs://test-convert-audio/audio-Savvy _June Cut_final.flac"
-
-