wit icon indicating copy to clipboard operation
wit copied to clipboard

Reconstruct audio time with tokens returned from dictation endpoint

Open rhenanbartels opened this issue 2 years ago • 2 comments

Do you want to request a feature, report a bug, or ask a question about wit? Question

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.

What is the expected behavior?

If applicable, what is the App ID where you are experiencing this issue? If you do not provide this, we cannot help.

Hi everyone, first of all, I would like to thank you y`all for the amazing work with the Wit service.

I have a question regarding the tokens returned from the new /dictation endpoint:

Short version

is it possible to precisely reconstruct the audio time length using the timecode of the tokens?

Longer version:

prior to the /dictation endpoint we used the /speech endpoint and sent chunks of approximately 20s of a longer audio (split on silence). To keep track of the audio time as the transcriptions proceeds, we use the following equation:

Bytes Per Second (bps) = Sample Rate (Hz) * Word Length (bits) * Channel Count * 0.125

Which tells us the interval of the chunk transcribed. Now, using the /dictation endpoint, we are trying to use the token's timecode to reconstruct the same interval, but the values do not match. Is there something we need to consider in this reconstruction using the token's timecode?

I am sending an example of the response which also includes the time interval obtained with the equation. It is possible to notice that the total time does not match. The sum of tokens is 16320 (16.32s), while the chunk sent is 16.5s long. It may seem a small difference, but the cumulative sum of all chunks is enough to mismatch the text with the audio.

{'end': 16.5,  # the length of the chunk sent to /dictation endpoint calculated with the equation
 'text': 'Tá, vamos Ponto. Quanto não tem dimensão? Isso não é uma definição, mas é uma característica dele',
 'start': 0.0,
 'tokens': [{'tokens': [{'end': 0, 'start': 0, 'token': ''},
    {'end': 5520, 'start': 4520, 'token': 'Tá,'},
    {'end': 6240, 'start': 5520, 'token': 'vamos'},
    {'end': 6240, 'start': 6240, 'token': ''}],
   'confidence': 0.8972},
  {'tokens': [{'end': 7800, 'start': 7800, 'token': ''},
    {'end': 10560, 'start': 9560, 'token': 'Ponto.'},
    {'end': 10920, 'start': 10560, 'token': ''}],
   'confidence': 0.7612},
  {'tokens': [{'end': 11700, 'start': 11700, 'token': ''},
    {'end': 13320, 'start': 12320, 'token': 'Quanto'},
    {'end': 13500, 'start': 13320, 'token': 'não'},
    {'end': 13620, 'start': 13500, 'token': 'tem'},
    {'end': 14100, 'start': 13620, 'token': 'dimensão?'},
    {'end': 14400, 'start': 14100, 'token': 'Isso'},
    {'end': 14580, 'start': 14400, 'token': 'não'},
    {'end': 14640, 'start': 14580, 'token': 'é'},
    {'end': 14760, 'start': 14640, 'token': 'uma'},
    {'end': 15120, 'start': 14760, 'token': 'definição,'},
    {'end': 15300, 'start': 15120, 'token': 'mas'},
    {'end': 15480, 'start': 15300, 'token': 'é'},
    {'end': 15540, 'start': 15480, 'token': 'uma'},
    {'end': 16020, 'start': 15540, 'token': 'característica'},
    {'end': 16320, 'start': 16020, 'token': 'dele'},
    {'end': 16320, 'start': 16320, 'token': ''}],
   'confidence': 0.8018}]}

rhenanbartels avatar Sep 27 '22 19:09 rhenanbartels

Can you please tell me how you got those text in Portuguese? because I am only getting in English and don't know how to change output language.

andysagar avatar Sep 28 '22 09:09 andysagar

@andysagar it`s a configuration set in the wit.ai platform while creating a new app. There is a dropdown menu with the languages

rhenanbartels avatar Sep 28 '22 11:09 rhenanbartels

Closing due to no movement on the issue. Please re-open or file a new task should the issue be persisting.

Barbog avatar Apr 18 '23 09:04 Barbog