giant
giant copied to clipboard
Use transcription service for transcripting/translation of audio/video files
Paired on this with @philmcmahon
What does this change?
This PR integrates transcription service into Giant.
- Adds ExternalTranscriptionExtractor which sends transcription message into the transcription service task (input) queue
- Adds ExternalWorkerScheduler which runs in intervals to check if there's any transcription output message
- Adds ExternalTranscriptionWorker which retrieves messages from giant transcription output queue
- Success message:
- updates elastic with the resulting transcript search if it's a success message
- updates neo4j relationship between the blob and extractor to processed
- deletes the message
- Failure message. Retries 3 times and if all are failure
- updates neo4j relationship between the blob and extractor to failure
- doesn't delete the message because the message will be moved to dead letter queue
- Success message:
- Creates download signed url (for downloading the audio/video file) before sending the message to transcription service task queue
- Creates upload signed urls (for uploading the transcript output) before sending the message to transcription service task queue
- Adds a new relationship between blob and extractor
PROCESSING_EXTERNALLYfor when the message is sent to external transcription service until the transcript output is ready and output message is delivered in the output queue - Handling translation if the audio/video is not in English
The following SSM parameters were created for playground but should also be created for pfi-giant (prod):
- /pfi/pfi-playground/rex/transcribe/transcriptionServiceQueueUrl
- /pfi/pfi-playground/rex/transcribe/transcriptionOutputQueueUrl
- /pfi/pfi-playground/rex/transcribe/transcriptionOutputDeadLetterQueueUrl
TODO in upcoming PR
- zipping & unzipping the transcripts file rather than handling 3 file formats separately
How to test
Tested locally and in code
The relevant PRs for this change and the order they need to be released are as followed: 1- https://github.com/guardian/investigations-platform/pull/521 2- https://github.com/guardian/transcription-service/pull/103 3- Current PR
This is looking great - just a few minor comments above