SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

How to add Phrase List to SpeechToTextSDK to improve transcription?

Open dhhailinh opened this issue 2 years ago • 7 comments

SynapseML version

synapseml_2.12:0.10.0

System information

  • Language version : Python 3.8.10
  • Spark Version (e.g. 3.2.2): 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
  • Spark Platform : Databricks

Describe the problem

Hello all,

I'm using SpeechToTextSDK of SynapseML Cognitives in Databricks to transcribe audio files into texts with below following code that works successfully but only without Phrase List :

I found a reference to create regular phrase list here : https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/improve-accuracy-phrase-list?tabs=terminal&pivots=programming-language-python#implement-phrase-list

But how can I add a Phrase List to the SpeechToTextSDK in SynapseML please? Thankyou,

Code to reproduce issue

import synapse.ml from synapse.ml.cognitive import *

stt = (SpeechToTextSDK() .setSubscriptionKey(YOUR_API_KEY) .setLocation(REGION) .setOutputCol("text") .setAudioDataCol("content") .setFormat("detailed") .setFileTypeCol("format") .setLanguageCol("lang") .setStreamIntermediateResults(False) )

results = stt.transform(audio_w_lang_format) display(results)

Other info / logs

No response

What component(s) does this bug affect?

  • [X] area/cognitive: Cognitive project
  • [ ] area/core: Core project
  • [ ] area/deep-learning: DeepLearning project
  • [ ] area/lightgbm: Lightgbm project
  • [ ] area/opencv: Opencv project
  • [ ] area/vw: VW project
  • [ ] area/website: Website
  • [ ] area/build: Project build system
  • [ ] area/notebooks: Samples under notebooks folder
  • [ ] area/docker: Docker usage
  • [ ] area/models: models related issue

What language(s) does this bug affect?

  • [ ] language/scala: Scala source code
  • [X] language/python: Pyspark APIs
  • [ ] language/r: R APIs
  • [ ] language/csharp: .NET APIs
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/synapse: Azure Synapse integrations
  • [ ] integrations/azureml: Azure ML integrations
  • [X] integrations/databricks: Databricks integrations

AB#1956013

dhhailinh avatar Aug 31 '22 08:08 dhhailinh

Hey @dhhailinh :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

github-actions[bot] avatar Aug 31 '22 08:08 github-actions[bot]

Hi @dhhailinh - It looks like we don't currently support the PhraseList functionality. Thank you for bringing this to our attention. It seems like something we should add. I've added this request to our list of potential work items. Whether it gets picked up will depend on where it lands in relation to other already scheduled items. Note that we do accept PRs from the public, should you be interested in contributing. Thanks again.

niehaus59 avatar Aug 31 '22 20:08 niehaus59

Hi @dhhailinh - It looks like we don't currently support the PhraseList functionality. Thank you for bringing this to our attention. It seems like something we should add. I've added this request to our list of potential work items. Whether it gets picked up will depend on where it lands in relation to other already scheduled items. Note that we do accept PRs from the public, should you be interested in contributing. Thanks again.

Hello @niehaus59 , Thanks for your answer.

PhraseList is indeed a very important funtionality, without this, I will need to come back to regular way of working with speechsdk in python and get rid of synapseml with spark power in Databricks. I guess that many spark or databricks users will be in my situation with SpeechToTextSDK or have to make a custom Transformer.

How can I contribute to accelerate the process? Should I create a new pull request? Thanks for your advice,

dhhailinh avatar Sep 01 '22 07:09 dhhailinh

@dhhailinh - Yes a PR would be the way to go. See https://github.com/microsoft/SynapseML/blob/master/website/docs/reference/contributing_guide.md and https://github.com/microsoft/SynapseML/blob/master/website/docs/reference/developer-readme.md

SpeechToTextSDK is at https://github.com/microsoft/SynapseML/blob/master/cognitive/src/main/scala/com/microsoft/azure/synapse/ml/cognitive/SpeechToTextSDK.scala

niehaus59 avatar Sep 01 '22 20:09 niehaus59

Hey @dhhailinh happy to hop on a call to help you get started. Thanks for your interest, it should be a fairly local fix!

mhamilton723 avatar Sep 01 '22 23:09 mhamilton723

Heres the main arch of this work

  1. Add an extra ServiceParam on SpeechSDKBase with type Array[String]

  2. Add the calls to add the phrases somewhere around these lcations

https://github.com/microsoft/SynapseML/blob/4115d4f0f2ea5210b9eafd777ff7dc6f4567a7fb/cognitive/src/main/scala/com/microsoft/azure/synapse/ml/cognitive/SpeechToTextSDK.scala#L445

and

https://github.com/microsoft/SynapseML/blob/4115d4f0f2ea5210b9eafd777ff7dc6f4567a7fb/cognitive/src/main/scala/com/microsoft/azure/synapse/ml/cognitive/SpeechToTextSDK.scala#L535

  1. Make sure the data necessary flows through by adding args to those functions, compiling, and seeing where upstream arguments need to be plumbed in

  2. Write a test to demonstrate the functionality works as expected

mhamilton723 avatar Sep 02 '22 03:09 mhamilton723

Thanks @mhamilton723 and @niehaus59 ,

I will have a look on the source code and come back to you.

dhhailinh avatar Sep 02 '22 08:09 dhhailinh