fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Enabling word-level timestamps for all W2L Decoders

Open abarcovschi opened this issue 1 year ago • 0 comments

Before submitting

  • [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • [x] Did you read the contributor guideline?
  • [ ] Did you make sure to update the docs?
  • [ ] Did you write any new necessary tests?

What does this PR do?

Fixes #3371 and extends #3627 to include the ability to return the frame numbers of all non-blank characters of a hypothesis for all wav2letter decoder classes, not only just for W2lKenLMDecoder. A method called get_symbols() was also added to the parent class for all the decoders (W2lDecoder) so that the non-blank characters of the hypothesis can be returned as a list of natural language characters and not just token ids. This helps in finding the word-boundary tokens later when calculating the word-level timestamp information using the following formula:

timestamp = frame_num * (audio_len / (num_frames * sample_rate))

where:

  • frame_num = the timestep of the symbol, as returned in the 'timesteps' field of Wl2Decoder.decode() outputs.
  • audio_len = the number of samples in the loaded audio file corresponding to the transcript (if using batched w2v2 acoustic model inference, will be zero padded to the length of the longest loaded audio file in the batch).
  • num_frames = the number of frames in the emission matrix returned by the w2v2 acoustic model inference for that audio file (if using batched inference, the number of frames for each audio file will be the same as in this case all loaded audio files are padded to the length of the longest audio file in the batch).
  • sample_rate = sample rate of loaded audio files (usually 16000 Hz).

PR review

@alexeib

abarcovschi avatar Dec 17 '23 22:12 abarcovschi