DeepFilterNet
DeepFilterNet copied to clipboard
I've made a torch reimplementation for both offline and streaming implementation. Would you be interested in accepting this contribution?
Hello! I wrote a complete torch implementation without using rust or tract. I'm thinking about PR. So I decided to ask you, would you be interested in accepting this contribution? This may be more helpful for understanding the model because the rust/tract is difficult to understand right now for average torch user. Also my streaming implementation can be fully exported into a single ONNX model. This is useful, for example, for easier use for web inference. I've made tests, that show equivalence with current code (or with not very old commit)
Hello! I wrote a complete torch implementation without using rust or tract. I'm thinking about PR. So I decided to ask you, are you interested in this? This may be more helpful for understanding the model because the rust/tract is difficult to understand right now for average torch user. Also my streaming implementation can be fully exported into a single ONNX model. This is useful, for example, for easier use for web inference.
I'm very interesing in your implementation. I tried to implement in streaming way in pure C with onnx model, but the delay is too long, I'd like to know how you implemented the streaming algorithm.
I'm very interested in this as well, you should submit a PR and see what they say. Have you tried running the standalone onnx with like onnxjs, and how big a footprint does the model leave?
Yeah, I've tried it using onnxruntime-web in web worker. It works fine. I didn't look at footprint, only on CPU usage. I can calculate it later
Can you toss a link to the branch/commit you're looking at PRing, I'd love to take it for a spin
My code is currently in my private gitlab repository. I think I'll be able to share this soon.
Could you share the link to your repo? I was really interested in removing the dependency on the complex denotation. The pure torch implementation will help immensely.
So, I've created draft PR to show my changes. I'll be glad to hear your feedback. This PR requires more changes to be compatible with current code, so it's draft only
I'm not sure if I've misunderstood, the main difference between 'Offline' and 'Streaming' as you mentioned is the ability to convert the entire model into ONNX format, rather than the difference in their applications, such as full audio processing and real-time applications (like internet calls), is it right?
@Penguin168
The Offline
model processes audio completely - we put the full audio spectrogram into the model.
The Streaming
model processes audio frame by frame. Only the streaming model can be exported to ONNX. And it can be used in real-time applications where the offline model cannot be used in this way.
If we process one audio with both models separately, the result will be the same.
Great Job!Have you tested the streaming mode performance on your phone's CPU? I tried it on Android today and found that the RTF is very large. Can the RTF be reduced by increasing hop_size?
I didn't tested it on phones, but soon i'm going to try it. I think, that changing hop_size can lead to quallity reducing. I thought about trying quantization to reduce RTF and checkpoint size. But didn't tried it yet
Hi,
Great work, thanks @grazder! I tried using an ONNX model with WebAssembly in a web browser. After 15 seconds, the sound became robotic. What do you think the problem might be here?
Hi! @zeynepgulhanuslu
I got a robotic voice if the model did not work fast enough and the AudioWorklet recieved audio samples too late. So it can be due to web implementation. Currently I have working demo with worklet + worker + ringbuffer setup. And It works fine on my mac. You can check model speed in your implementation. And if it's not, so it because of model quallity, but I didn't face it yet.
Also can you share details about your Web setup?
I actually attempted to use it within Jitsi-Meet as a replacement for the RNNoise module, and it takes 7 ms for inference on 480 samples. Perhaps I could use VAD (Voice Activity Detection) so that I don't have to run the model all the time. If you're interested, I can send the CPP code via email. Thank you for the detailed answer
Yeah, that would be great. Do you build wasm module using CPP? 7ms seems enough, because you have 10ms window to run model
@zeynepgulhanuslu I understood the problem. Jitsi if im not mistaken run wasm module inside AudioWorklet. So there you have only 128 / sampleRate * 1000
window (here is 2.6 ms). So 7ms is too long for it. Every time you run inference inside AudioWorklet .process
function a lag is created. So because of delay sound can be robotic
As a fix you can try different solution:
https://developer.chrome.com/blog/audio-worklet-design-pattern/ https://github.com/WebAudio/web-audio-api/discussions/2550
@grazder thank you for your suggestions, I will examine different solutions as well. I thought jitsi meet takes 128 long sound and sends it to the process function when it is 480. Later when I tested it in python code and with wasm, I noticed that it process some frames in 15 ms. Maybe it could be related to this. Do you have any idea what can be done to speed up the model?
` /** * Process an audio frame, optionally denoising the input pcmFrame and returning the Voice Activity Detection score * for a raw Float32 PCM sample Array. * The size of the array must be of exactly 480 samples, this constraint comes from the rnnoise library. * * @param {Float32Array} pcmFrame - Array containing 32 bit PCM samples. Parameter is also used as output * when {@code shouldDenoise} is true. * @param {boolean} shouldDenoise - Should the denoised frame be returned in pcmFrame. * @returns {Float} Contains VAD score in the interval 0 - 1 i.e. 0.90 . */ processAudioFrame(pcmFrame: Float32Array, shouldDenoise: Boolean = false): number { // Convert 32 bit Float PCM samples to 16 bit Float PCM samples as that's what rnnoise accepts as input for (let i = 0; i < RNNOISE_SAMPLE_LENGTH; i++) { this._wasmInterface.HEAPF32[this._wasmPcmInputF32Index + i] = pcmFrame[i] * SHIFT_16_BIT_NR; }
// Use the same buffer for input/output, rnnoise supports this behavior
const vadScore = this._wasmInterface._rnnoise_process_frame(
this._context,
this._wasmPcmInput,
this._wasmPcmInput
);
// Rnnoise denoises the frame by default but we can avoid unnecessary operations if the calling
// client doesn't use the denoised frame.
if (shouldDenoise) {
// Convert back to 32 bit PCM
for (let i = 0; i < RNNOISE_SAMPLE_LENGTH; i++) {
pcmFrame[i] = this._wasmInterface.HEAPF32[this._wasmPcmInputF32Index + i] / SHIFT_16_BIT_NR;
}
}
return vadScore;
}
`
I think you can remove normalization steps for model (in your cpp code). I don't do it and it works fine. Also you can profile your solution in browser. I saw some lags because of garbage collector.
Also I use wasm-simd for ONNX inference and it works faster than wasm onnx (i don't build wasm module right now). About model speed up: I've already said about Post-training static quantization and different graph optimizations. Right now I don't have other ideas. May be I will try some things if it will be necessary for me.
@Penguin168 The
Offline
model processes audio completely - we put the full audio spectrogram into the model. TheStreaming
model processes audio frame by frame. Only the streaming model can be exported to ONNX. And it can be used in real-time applications where the offline model cannot be used in this way. If we process one audio with both models separately, the result will be the same.
thanks for the excellent job , i test the PR BRANCH "Git commit: 5224ed4, branch: torchDF-changes" , both offline and streaming modes, and found that the quality of the voice processed by the offline version was consistent with the official testing procedure, while the audio quality loss returned by the streaming mode was more severe (start from about 6s, 0~6s was ok). I don't know what caused it. Could you help me? Thank you!
offline infer log:
python3 torch_df_offline.py --input-folder ./data/in/ --output-folder ./data/out/
/workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData
has been moved to torchaudio.AudioMetaData
. Please update the import path.
from torchaudio.backend.common import AudioMetaData
2023-10-18 18:02:43 | INFO | DF | Running on torch 2.1.0+cu121
2023-10-18 18:02:43 | INFO | DF | Running on host hugo-home
2023-10-18 18:02:43 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes
2023-10-18 18:02:43 | INFO | DF | Loading model settings of DeepFilterNet3
2023-10-18 18:02:43 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3
2023-10-18 18:02:43 | INFO | DF | Initializing model deepfilternet3
2023-10-18 18:02:43 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2023-10-18 18:02:43 | INFO | DF | Running on device cuda:0
2023-10-18 18:02:43 | INFO | DF | Model loaded
Reading audio from folder - ./data/in/
Found 1 audio in ./data/in/
Inferencing model to folder - ./data/out/...
streaming infer logs:
python3 torch_df_streaming.py --audio-path ../linein_aec2.wav --output-path data/out/denoised_test_streaming22.wav
/workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData
has been moved to torchaudio.AudioMetaData
. Please update the import path.
from torchaudio.backend.common import AudioMetaData
2023-10-18 17:49:47 | INFO | DF | Running on torch 2.1.0+cu121
2023-10-18 17:49:47 | INFO | DF | Running on host hugo-home
2023-10-18 17:49:47 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes
2023-10-18 17:49:47 | INFO | DF | Loading model settings of DeepFilterNet3
2023-10-18 17:49:47 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3
2023-10-18 17:49:47 | INFO | DF | Initializing model deepfilternet3
2023-10-18 17:49:47 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2023-10-18 17:49:47 | INFO | DF | Running on device cuda:0
2023-10-18 17:49:47 | INFO | DF | Model loaded
I didn't tested it on phones, but soon i'm going to try it. I think, that changing hop_size can lead to quallity reducing. I thought about trying quantization to reduce RTF and checkpoint size. But didn't tried it yet
@grazder Thank you for your reply. As far as I know, torch quantization is not very friendly. How is the progress of your model quantization ?
@GreatDarrenSun I'm not currently working on this because the current latency suits my current needs. But maybe someday later I'll start
@Penguin168 The
Offline
model processes audio completely - we put the full audio spectrogram into the model. TheStreaming
model processes audio frame by frame. Only the streaming model can be exported to ONNX. And it can be used in real-time applications where the offline model cannot be used in this way. If we process one audio with both models separately, the result will be the same.thanks for the excellent job , i test the PR BRANCH "Git commit: 5224ed4, branch: torchDF-changes" , both offline and streaming modes, and found that the quality of the voice processed by the offline version was consistent with the official testing procedure, while the audio quality loss returned by the streaming mode was more severe (start from about 6s, 0~6s was ok). I don't know what caused it. Could you help me? Thank you! offline infer log: python3 torch_df_offline.py --input-folder ./data/in/ --output-folder ./data/out/ /workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning:
torchaudio.backend.common.AudioMetaData
has been moved totorchaudio.AudioMetaData
. Please update the import path. from torchaudio.backend.common import AudioMetaData 2023-10-18 18:02:43 | INFO | DF | Running on torch 2.1.0+cu121 2023-10-18 18:02:43 | INFO | DF | Running on host hugo-home 2023-10-18 18:02:43 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes 2023-10-18 18:02:43 | INFO | DF | Loading model settings of DeepFilterNet3 2023-10-18 18:02:43 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2023-10-18 18:02:43 | INFO | DF | Initializing modeldeepfilternet3
2023-10-18 18:02:43 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2023-10-18 18:02:43 | INFO | DF | Running on device cuda:0 2023-10-18 18:02:43 | INFO | DF | Model loaded Reading audio from folder - ./data/in/ Found 1 audio in ./data/in/ Inferencing model to folder - ./data/out/...streaming infer logs: python3 torch_df_streaming.py --audio-path ../linein_aec2.wav --output-path data/out/denoised_test_streaming22.wav /workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning:
torchaudio.backend.common.AudioMetaData
has been moved totorchaudio.AudioMetaData
. Please update the import path. from torchaudio.backend.common import AudioMetaData 2023-10-18 17:49:47 | INFO | DF | Running on torch 2.1.0+cu121 2023-10-18 17:49:47 | INFO | DF | Running on host hugo-home 2023-10-18 17:49:47 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes 2023-10-18 17:49:47 | INFO | DF | Loading model settings of DeepFilterNet3 2023-10-18 17:49:47 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2023-10-18 17:49:47 | INFO | DF | Initializing modeldeepfilternet3
2023-10-18 17:49:47 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2023-10-18 17:49:47 | INFO | DF | Running on device cuda:0 2023-10-18 17:49:47 | INFO | DF | Model loaded
it sounds slight bad when some states not update and the input sound be slilence . continue update states fixed this.
Hello! I wrote a complete torch implementation without using rust or tract. I'm thinking about PR. So I decided to ask you, would you be interested in accepting this contribution? This may be more helpful for understanding the model because the rust/tract is difficult to understand right now for average torch user. Also my streaming implementation can be fully exported into a single ONNX model. This is useful, for example, for easier use for web inference. I've made tests, that show equivalence with current code (or with not very old commit)
hello, can you share me the streaming implementation, i am very insteresting , but i don't know how to do it, thank you very much, my email is : [email protected]
@308627993
You can find it here - https://github.com/Rikorose/DeepFilterNet/blob/5224ed47e19b6b327c9f22df6c5a061e3c3f6d5f/torchDF/torch_df_streaming.py
Here is PR - https://github.com/Rikorose/DeepFilterNet/pull/433/
hi, thanks a lot for your help!
---Original--- From: "Alexey @.> Date: Tue, Oct 31, 2023 16:52 PM To: @.>; Cc: @.@.>; Subject: Re: [Rikorose/DeepFilterNet] I've made a torch reimplementation forboth offline and streaming implementation. Would you be interested inaccepting this contribution? (Issue #430)
@308627993
You can find it here - https://github.com/Rikorose/DeepFilterNet/blob/5224ed47e19b6b327c9f22df6c5a061e3c3f6d5f/torchDF/torch_df_streaming.py
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@grazder Hey! Thank you for your work on this. It's really cool to see the torch reimplementation running!
I wanted to get your input on what I think is a bug with the streaming
model.
If you listen to this collection of samples there is some light chop in the streaming
version that does not appear in the offline
or normal DF3 output. I think you can hear it best near the end.
Do you have any thoughts on what might be causing it?
Well, you can try infer with apply_all_stages=True
parameter (in init), so you will get output like in offline model. I bet problem here. It's made to reach full match with rust implementation
Well, you can try infer with
apply_all_stages=True
parameter (in init), so you will get output like in offline model. I bet problem here. It's made to reach full match with rust implementation
I appreciate the reply! I just gave that a try but the output still retained the choppiness.
Does anything else come to mind that be might at play?
Can you send an audio please? I'll try to debug it tomorrow
Can you send an audio please? I'll try to debug it tomorrow
Of course. Thanks for looking into it!
I've attached the original sample and the two denoised versions from my video.