candle tracking: support silero-vad v5

This ticket tracks my efforts to get silero-vad v5 running with candle. I have a proof-of-concept branch working, but I'm trying to merge the changes in meaningful bits.

features:

[x] onnx eval "If" op #2251
[x] onnx eval "Pad" op with mode == "reflect" #2251 #2317
[x] onnx eval "Slice" op #2260
[x] onnx eval "LSTM" op #2268
[x] onnx eval "Size" op #2316
[ ] vad example in candle-examples #2321

bugs:

[x] unsqueeze with negative index is off-by-one #2317
[ ] index_select with negative index throws an error #2320
[ ] pow on cpu backend returns NaN incorrectly #2318

Jul 07 '24 19:07 shua

@LaurentMazare you asked in https://github.com/huggingface/candle/pull/2251#issuecomment-2153366138 whether I had a vad example. I've gotten it working locally after fixing the above bugs or implementing the above features. #2321 would be the final culmination of these tickets, which I can open for review once the other tickets have been merged.

Jul 10 '24 16:07 shua

I don't know if you're interested, but as part of this work I have a local test harness which will run an onnx model in candle-onnx and ort and then if they differ in output, it will find specifically which node they differ. That's how I was able to find the bugs listed above, and it was a lot quicker than trying to make pytorch examples.

I can try to PR that if you'd like but not sure if it should be in its own package or some feature-gate on candle-onnx to avoid depending on ort in general.

Jul 10 '24 16:07 shua

ping @LaurentMazare is there anything I can do to make these PRs easier to review?

Aug 01 '24 20:08 shua

Sorry for the delay, the LSTM part looked good so I just merged it. Re the two bug fixes, I'm a bit less enclined to do so as they are currently conscious restrictions of the pow and select operators. Are these actually needed for this model?

Aug 19 '24 07:08 LaurentMazare

The specific use in silero models is index by -1 and raising tensor (which contains negative numbers) to a scalar power of 2. I could special case fixes in the cpu backend or onnx simple eval to catch these cases and do something equivalent (eg repeated mul instead of pow), but I suspect that would lead to more bugs filed like "why can I index_select with -1 but not -2?"

I also suspect (but cannot confirm) that this behaviour is already correct in the gpu backends but only broken in the cpu backend.

Aug 19 '24 08:08 shua

Also, if I knew why you are avoiding using the system's pow builtins, I could try to accommodate that?

Aug 19 '24 08:08 shua

The goal is mainly to reduce the number of operations that each backend must implement, hence the binary pow implemented with exp/log at the moment. Note that there is a unary pow operation powf that takes as input the same exponent for all the elements of the tensor and this one is the most likely to be used with negative inputs though I agree it doesn't cover all possible use cases.

Aug 19 '24 13:08 LaurentMazare

huh, I think that is this usecase, but it looks like onnx simpl_eval simply calls broadcast_pow when evaluating the "Pow" node type. I could instead add a check for scalar power and branch on that whether it calls powf vs broadcast_pow in the onnx code, or I could add a branch in broadcast_pow to fallback to powf if conditions match. I'm leaning towards keeping hacks in the edges at simple_eval rather than deeper.

Aug 19 '24 18:08 shua

Yeah that sounds right to me, optimizing nodes that are Pow with the second arg being a Constant using a single value is likely to cover most use cases and would keep the hacks in simple_eval or around.

Aug 19 '24 19:08 LaurentMazare

Okay, found a similarly focused fix for the index_select issue. Once #2440 is merged, I can open #2321 to add a vad example.

Aug 22 '24 13:08 shua

I can open #2321 to add a vad example.

Yes that seems like a great example to include, especially as we don't have much audio examples yet.

Aug 22 '24 13:08 LaurentMazare

@shua great addition, we use web rtc vad atm and will try soon https://github.com/mediar-ai/screenpipe/issues/241

wonder if the silero model itself is better than whisper? we have lot of issues with whisper (distil large v3) "thank you" and stuff like this when nobody speaks (but prob the silero vad will solve this) because we record all audio devices 24/7

Aug 28 '24 15:08 louis030195

I can't speak to the quality of the model. I came upon silero-vad because it's an optional part of https://github.com/rhasspy/wyoming-satellite to put in front of openwakeword or whisper. More generally, voice activity detection (vad) is solving different problems from speech to text like whisper, so I don't know if it makes sense to compare the two.

Sep 06 '24 11:09 shua

candle candle copied to clipboard

tracking: support silero-vad v5

candle
candle copied to clipboard