transformers.js
transformers.js copied to clipboard
Add Whisper Voice Activity Detector (VAD) or Silero VAD for silence suppression
Feature request
New feature using VAD for silence suppression. A better description can be found at https://github.com/jianfch/stable-ts?tab=readme-ov-file#silence-suppression
Motivation
Current Whisper implementation fails to capture silences in word-level settings due to absence of VAD:
} else {
current_tokens.push(token);
if (returnWordTimestamps) {
let start_time = round(token_timestamps[i] + time_offset, 2);
let end_time;
if (i + 1 < token_timestamps.length) {
end_time = round(token_timestamps[i + 1] + time_offset, 2);
} else {
end_time = null;
}
current_token_timestamps.push([start_time, end_time]);
}
}
The decoding algorithm just selected end_time as the next predicted token timestamps but this does not handle at all silences nor breaks in the audio file. In some cases, the predicted timestamps anticipates the real presence of the word in the audio file hence generating wrong word timestamps.
Your contribution
https://github.com/MatteoFasulo/transformers.js/blob/fix-word-timestamps/src/tokenizers.js :
if (returnWordTimestamps) {
let start_time = round(token_timestamps[i] + time_offset, 2);
let end_time;
let regex = /[!.,;?]+$/
let decoded_text = this.decode([token]);
if (i + 1 < token_timestamps.length) {
end_time = round(token_timestamps[i + 1] + time_offset, 2);
// If the token is a punctuation mark, we can assume it's the end of a word in most cases
if (regex.test(decoded_text)) {
end_time = round(start_time + 0.02, 2); // +0.02 to avoid overlapping timestamps
}
} else {
// should never happen
end_time = null;
}
current_token_timestamps.push([start_time, end_time]);
}
The regex pattern should detect punctuation and add a reasonable timestamp for that (+0.02 just to avoid overlapping). This temporary solution should address the issue while using return_timestamps: 'word'
.
I am not able to work on VAD or Silero VAD right now but a PR with these changes would be great. Waiting for @xenova thoughts on this.