giant icon indicating copy to clipboard operation
giant copied to clipboard

Use transcription service for transcripting/translation of audio/video files

Open marjisound opened this issue 1 year ago • 1 comments

Paired on this with @philmcmahon

What does this change?

This PR integrates transcription service into Giant.

  • Adds ExternalTranscriptionExtractor which sends transcription message into the transcription service task (input) queue
  • Adds ExternalWorkerScheduler which runs in intervals to check if there's any transcription output message
  • Adds ExternalTranscriptionWorker which retrieves messages from giant transcription output queue
    • Success message:
      • updates elastic with the resulting transcript search if it's a success message
      • updates neo4j relationship between the blob and extractor to processed
      • deletes the message
    • Failure message. Retries 3 times and if all are failure
      • updates neo4j relationship between the blob and extractor to failure
      • doesn't delete the message because the message will be moved to dead letter queue
  • Creates download signed url (for downloading the audio/video file) before sending the message to transcription service task queue
  • Creates upload signed urls (for uploading the transcript output) before sending the message to transcription service task queue
  • Adds a new relationship between blob and extractor PROCESSING_EXTERNALLY for when the message is sent to external transcription service until the transcript output is ready and output message is delivered in the output queue
  • Handling translation if the audio/video is not in English

The following SSM parameters were created for playground but should also be created for pfi-giant (prod):

  • /pfi/pfi-playground/rex/transcribe/transcriptionServiceQueueUrl
  • /pfi/pfi-playground/rex/transcribe/transcriptionOutputQueueUrl
  • /pfi/pfi-playground/rex/transcribe/transcriptionOutputDeadLetterQueueUrl

TODO in upcoming PR

  • zipping & unzipping the transcripts file rather than handling 3 file formats separately

How to test

Tested locally and in code

The relevant PRs for this change and the order they need to be released are as followed: 1- https://github.com/guardian/investigations-platform/pull/521 2- https://github.com/guardian/transcription-service/pull/103 3- Current PR

marjisound avatar Oct 02 '24 07:10 marjisound

This is looking great - just a few minor comments above

philmcmahon avatar Oct 03 '24 10:10 philmcmahon