transformers.js Add Whisper Voice Activity Detector (VAD) or Silero VAD for silence suppression

Add Whisper Voice Activity Detector (VAD) or Silero VAD for silence suppression

Open MatteoFasulo opened this issue 7 months ago • 4 comments

Feature request

New feature using VAD for silence suppression. A better description can be found at https://github.com/jianfch/stable-ts?tab=readme-ov-file#silence-suppression

Motivation

Current Whisper implementation fails to capture silences in word-level settings due to absence of VAD:

} else {
  current_tokens.push(token);
  if (returnWordTimestamps) {
    let start_time = round(token_timestamps[i] + time_offset, 2);
    let end_time;
    if (i + 1 < token_timestamps.length) {
      end_time = round(token_timestamps[i + 1] + time_offset, 2);
    } else {
      end_time = null;
    }
    current_token_timestamps.push([start_time, end_time]);
  }
}

The decoding algorithm just selected end_time as the next predicted token timestamps but this does not handle at all silences nor breaks in the audio file. In some cases, the predicted timestamps anticipates the real presence of the word in the audio file hence generating wrong word timestamps.

Your contribution

https://github.com/MatteoFasulo/transformers.js/blob/fix-word-timestamps/src/tokenizers.js :

if (returnWordTimestamps) {
    let start_time = round(token_timestamps[i] + time_offset, 2);

    let end_time;
    let regex = /[!.,;?]+$/
    let decoded_text = this.decode([token]);

    if (i + 1 < token_timestamps.length) {
        end_time = round(token_timestamps[i + 1] + time_offset, 2);

        // If the token is a punctuation mark, we can assume it's the end of a word in most cases
        if (regex.test(decoded_text)) {
            end_time = round(start_time + 0.02, 2); // +0.02 to avoid overlapping timestamps
          }
    } else {
        // should never happen
        end_time = null;
    }
    current_token_timestamps.push([start_time, end_time]);
}

The regex pattern should detect punctuation and add a reasonable timestamp for that (+0.02 just to avoid overlapping). This temporary solution should address the issue while using return_timestamps: 'word'.

I am not able to work on VAD or Silero VAD right now but a PR with these changes would be great. Waiting for @xenova thoughts on this.

Jun 25 '24 11:06 MatteoFasulo

transformers.js transformers.js copied to clipboard

Add Whisper Voice Activity Detector (VAD) or Silero VAD for silence suppression

Feature request

Motivation

Your contribution

transformers.js
transformers.js copied to clipboard