DeepFilterNet icon indicating copy to clipboard operation
DeepFilterNet copied to clipboard

I've made a torch reimplementation for both offline and streaming implementation. Would you be interested in accepting this contribution?

Open grazder opened this issue 1 year ago • 47 comments

Hello! I wrote a complete torch implementation without using rust or tract. I'm thinking about PR. So I decided to ask you, would you be interested in accepting this contribution? This may be more helpful for understanding the model because the rust/tract is difficult to understand right now for average torch user. Also my streaming implementation can be fully exported into a single ONNX model. This is useful, for example, for easier use for web inference. I've made tests, that show equivalence with current code (or with not very old commit)

grazder avatar Sep 14 '23 08:09 grazder

Hello! I wrote a complete torch implementation without using rust or tract. I'm thinking about PR. So I decided to ask you, are you interested in this? This may be more helpful for understanding the model because the rust/tract is difficult to understand right now for average torch user. Also my streaming implementation can be fully exported into a single ONNX model. This is useful, for example, for easier use for web inference.

I'm very interesing in your implementation. I tried to implement in streaming way in pure C with onnx model, but the delay is too long, I'd like to know how you implemented the streaming algorithm.

Penguin168 avatar Sep 14 '23 09:09 Penguin168

I'm very interested in this as well, you should submit a PR and see what they say. Have you tried running the standalone onnx with like onnxjs, and how big a footprint does the model leave?

skyler14 avatar Sep 16 '23 05:09 skyler14

Yeah, I've tried it using onnxruntime-web in web worker. It works fine. I didn't look at footprint, only on CPU usage. I can calculate it later

grazder avatar Sep 16 '23 07:09 grazder

Can you toss a link to the branch/commit you're looking at PRing, I'd love to take it for a spin

skyler14 avatar Sep 17 '23 01:09 skyler14

My code is currently in my private gitlab repository. I think I'll be able to share this soon.

grazder avatar Sep 18 '23 07:09 grazder

Could you share the link to your repo? I was really interested in removing the dependency on the complex denotation. The pure torch implementation will help immensely.

IannoIITR avatar Sep 19 '23 12:09 IannoIITR

So, I've created draft PR to show my changes. I'll be glad to hear your feedback. This PR requires more changes to be compatible with current code, so it's draft only

grazder avatar Sep 20 '23 08:09 grazder

I'm not sure if I've misunderstood, the main difference between 'Offline' and 'Streaming' as you mentioned is the ability to convert the entire model into ONNX format, rather than the difference in their applications, such as full audio processing and real-time applications (like internet calls), is it right?

Penguin168 avatar Sep 27 '23 07:09 Penguin168

@Penguin168 The Offline model processes audio completely - we put the full audio spectrogram into the model. The Streaming model processes audio frame by frame. Only the streaming model can be exported to ONNX. And it can be used in real-time applications where the offline model cannot be used in this way. If we process one audio with both models separately, the result will be the same.

grazder avatar Sep 27 '23 08:09 grazder

Great Job!Have you tested the streaming mode performance on your phone's CPU? I tried it on Android today and found that the RTF is very large. Can the RTF be reduced by increasing hop_size?

GreatDarrenSun avatar Sep 29 '23 12:09 GreatDarrenSun

I didn't tested it on phones, but soon i'm going to try it. I think, that changing hop_size can lead to quallity reducing. I thought about trying quantization to reduce RTF and checkpoint size. But didn't tried it yet

grazder avatar Sep 30 '23 16:09 grazder

Hi,

Great work, thanks @grazder! I tried using an ONNX model with WebAssembly in a web browser. After 15 seconds, the sound became robotic. What do you think the problem might be here?

zeynepgulhanuslu avatar Sep 30 '23 16:09 zeynepgulhanuslu

Hi! @zeynepgulhanuslu

I got a robotic voice if the model did not work fast enough and the AudioWorklet recieved audio samples too late. So it can be due to web implementation. Currently I have working demo with worklet + worker + ringbuffer setup. And It works fine on my mac. You can check model speed in your implementation. And if it's not, so it because of model quallity, but I didn't face it yet.

Also can you share details about your Web setup?

grazder avatar Sep 30 '23 16:09 grazder

I actually attempted to use it within Jitsi-Meet as a replacement for the RNNoise module, and it takes 7 ms for inference on 480 samples. Perhaps I could use VAD (Voice Activity Detection) so that I don't have to run the model all the time. If you're interested, I can send the CPP code via email. Thank you for the detailed answer

zeynepgulhanuslu avatar Oct 02 '23 13:10 zeynepgulhanuslu

Yeah, that would be great. Do you build wasm module using CPP? 7ms seems enough, because you have 10ms window to run model

grazder avatar Oct 02 '23 17:10 grazder

@zeynepgulhanuslu I understood the problem. Jitsi if im not mistaken run wasm module inside AudioWorklet. So there you have only 128 / sampleRate * 1000 window (here is 2.6 ms). So 7ms is too long for it. Every time you run inference inside AudioWorklet .process function a lag is created. So because of delay sound can be robotic

As a fix you can try different solution:

https://developer.chrome.com/blog/audio-worklet-design-pattern/ https://github.com/WebAudio/web-audio-api/discussions/2550

grazder avatar Oct 05 '23 11:10 grazder

@grazder thank you for your suggestions, I will examine different solutions as well. I thought jitsi meet takes 128 long sound and sends it to the process function when it is 480. Later when I tested it in python code and with wasm, I noticed that it process some frames in 15 ms. Maybe it could be related to this. Do you have any idea what can be done to speed up the model?

` /** * Process an audio frame, optionally denoising the input pcmFrame and returning the Voice Activity Detection score * for a raw Float32 PCM sample Array. * The size of the array must be of exactly 480 samples, this constraint comes from the rnnoise library. * * @param {Float32Array} pcmFrame - Array containing 32 bit PCM samples. Parameter is also used as output * when {@code shouldDenoise} is true. * @param {boolean} shouldDenoise - Should the denoised frame be returned in pcmFrame. * @returns {Float} Contains VAD score in the interval 0 - 1 i.e. 0.90 . */ processAudioFrame(pcmFrame: Float32Array, shouldDenoise: Boolean = false): number { // Convert 32 bit Float PCM samples to 16 bit Float PCM samples as that's what rnnoise accepts as input for (let i = 0; i < RNNOISE_SAMPLE_LENGTH; i++) { this._wasmInterface.HEAPF32[this._wasmPcmInputF32Index + i] = pcmFrame[i] * SHIFT_16_BIT_NR; }

    // Use the same buffer for input/output, rnnoise supports this behavior
    const vadScore = this._wasmInterface._rnnoise_process_frame(
        this._context,
        this._wasmPcmInput,
        this._wasmPcmInput
    );

    // Rnnoise denoises the frame by default but we can avoid unnecessary operations if the calling
    // client doesn't use the denoised frame.
    if (shouldDenoise) {
        // Convert back to 32 bit PCM
        for (let i = 0; i < RNNOISE_SAMPLE_LENGTH; i++) {
            pcmFrame[i] = this._wasmInterface.HEAPF32[this._wasmPcmInputF32Index + i] / SHIFT_16_BIT_NR;
        }
    }

    return vadScore;
}

`

zeynepgulhanuslu avatar Oct 05 '23 11:10 zeynepgulhanuslu

I think you can remove normalization steps for model (in your cpp code). I don't do it and it works fine. Also you can profile your solution in browser. I saw some lags because of garbage collector.

Also I use wasm-simd for ONNX inference and it works faster than wasm onnx (i don't build wasm module right now). About model speed up: I've already said about Post-training static quantization and different graph optimizations. Right now I don't have other ideas. May be I will try some things if it will be necessary for me.

grazder avatar Oct 05 '23 14:10 grazder

@Penguin168 The Offline model processes audio completely - we put the full audio spectrogram into the model. The Streaming model processes audio frame by frame. Only the streaming model can be exported to ONNX. And it can be used in real-time applications where the offline model cannot be used in this way. If we process one audio with both models separately, the result will be the same.

thanks for the excellent job , i test the PR BRANCH "Git commit: 5224ed4, branch: torchDF-changes" , both offline and streaming modes, and found that the quality of the voice processed by the offline version was consistent with the official testing procedure, while the audio quality loss returned by the streaming mode was more severe (start from about 6s, 0~6s was ok). I don't know what caused it. Could you help me? Thank you! offline infer log: python3 torch_df_offline.py --input-folder ./data/in/ --output-folder ./data/out/ /workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData has been moved to torchaudio.AudioMetaData. Please update the import path. from torchaudio.backend.common import AudioMetaData 2023-10-18 18:02:43 | INFO | DF | Running on torch 2.1.0+cu121 2023-10-18 18:02:43 | INFO | DF | Running on host hugo-home 2023-10-18 18:02:43 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes 2023-10-18 18:02:43 | INFO | DF | Loading model settings of DeepFilterNet3 2023-10-18 18:02:43 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2023-10-18 18:02:43 | INFO | DF | Initializing model deepfilternet3 2023-10-18 18:02:43 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2023-10-18 18:02:43 | INFO | DF | Running on device cuda:0 2023-10-18 18:02:43 | INFO | DF | Model loaded Reading audio from folder - ./data/in/ Found 1 audio in ./data/in/ Inferencing model to folder - ./data/out/...

streaming infer logs: python3 torch_df_streaming.py --audio-path ../linein_aec2.wav --output-path data/out/denoised_test_streaming22.wav /workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData has been moved to torchaudio.AudioMetaData. Please update the import path. from torchaudio.backend.common import AudioMetaData 2023-10-18 17:49:47 | INFO | DF | Running on torch 2.1.0+cu121 2023-10-18 17:49:47 | INFO | DF | Running on host hugo-home 2023-10-18 17:49:47 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes 2023-10-18 17:49:47 | INFO | DF | Loading model settings of DeepFilterNet3 2023-10-18 17:49:47 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2023-10-18 17:49:47 | INFO | DF | Initializing model deepfilternet3 2023-10-18 17:49:47 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2023-10-18 17:49:47 | INFO | DF | Running on device cuda:0 2023-10-18 17:49:47 | INFO | DF | Model loaded

walletiger avatar Oct 18 '23 10:10 walletiger

I didn't tested it on phones, but soon i'm going to try it. I think, that changing hop_size can lead to quallity reducing. I thought about trying quantization to reduce RTF and checkpoint size. But didn't tried it yet

@grazder Thank you for your reply. As far as I know, torch quantization is not very friendly. How is the progress of your model quantization ?

GreatDarrenSun avatar Oct 23 '23 14:10 GreatDarrenSun

@GreatDarrenSun I'm not currently working on this because the current latency suits my current needs. But maybe someday later I'll start

grazder avatar Oct 25 '23 07:10 grazder

@Penguin168 The Offline model processes audio completely - we put the full audio spectrogram into the model. The Streaming model processes audio frame by frame. Only the streaming model can be exported to ONNX. And it can be used in real-time applications where the offline model cannot be used in this way. If we process one audio with both models separately, the result will be the same.

thanks for the excellent job , i test the PR BRANCH "Git commit: 5224ed4, branch: torchDF-changes" , both offline and streaming modes, and found that the quality of the voice processed by the offline version was consistent with the official testing procedure, while the audio quality loss returned by the streaming mode was more severe (start from about 6s, 0~6s was ok). I don't know what caused it. Could you help me? Thank you! offline infer log: python3 torch_df_offline.py --input-folder ./data/in/ --output-folder ./data/out/ /workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData has been moved to torchaudio.AudioMetaData. Please update the import path. from torchaudio.backend.common import AudioMetaData 2023-10-18 18:02:43 | INFO | DF | Running on torch 2.1.0+cu121 2023-10-18 18:02:43 | INFO | DF | Running on host hugo-home 2023-10-18 18:02:43 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes 2023-10-18 18:02:43 | INFO | DF | Loading model settings of DeepFilterNet3 2023-10-18 18:02:43 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2023-10-18 18:02:43 | INFO | DF | Initializing model deepfilternet3 2023-10-18 18:02:43 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2023-10-18 18:02:43 | INFO | DF | Running on device cuda:0 2023-10-18 18:02:43 | INFO | DF | Model loaded Reading audio from folder - ./data/in/ Found 1 audio in ./data/in/ Inferencing model to folder - ./data/out/...

streaming infer logs: python3 torch_df_streaming.py --audio-path ../linein_aec2.wav --output-path data/out/denoised_test_streaming22.wav /workspace/aec/DeepFilterNet/DeepFilterNet/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData has been moved to torchaudio.AudioMetaData. Please update the import path. from torchaudio.backend.common import AudioMetaData 2023-10-18 17:49:47 | INFO | DF | Running on torch 2.1.0+cu121 2023-10-18 17:49:47 | INFO | DF | Running on host hugo-home 2023-10-18 17:49:47 | INFO | DF | Git commit: 5224ed4, branch: torchDF-changes 2023-10-18 17:49:47 | INFO | DF | Loading model settings of DeepFilterNet3 2023-10-18 17:49:47 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3 2023-10-18 17:49:47 | INFO | DF | Initializing model deepfilternet3 2023-10-18 17:49:47 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2023-10-18 17:49:47 | INFO | DF | Running on device cuda:0 2023-10-18 17:49:47 | INFO | DF | Model loaded

it sounds slight bad when some states not update and the input sound be slilence . continue update states fixed this.

walletiger avatar Oct 25 '23 10:10 walletiger

Hello! I wrote a complete torch implementation without using rust or tract. I'm thinking about PR. So I decided to ask you, would you be interested in accepting this contribution? This may be more helpful for understanding the model because the rust/tract is difficult to understand right now for average torch user. Also my streaming implementation can be fully exported into a single ONNX model. This is useful, for example, for easier use for web inference. I've made tests, that show equivalence with current code (or with not very old commit)

hello, can you share me the streaming implementation, i am very insteresting , but i don't know how to do it, thank you very much, my email is : [email protected]

308627993 avatar Oct 31 '23 07:10 308627993

@308627993

You can find it here - https://github.com/Rikorose/DeepFilterNet/blob/5224ed47e19b6b327c9f22df6c5a061e3c3f6d5f/torchDF/torch_df_streaming.py

Here is PR - https://github.com/Rikorose/DeepFilterNet/pull/433/

grazder avatar Oct 31 '23 08:10 grazder

hi, thanks a lot for your help!

---Original--- From: "Alexey @.> Date: Tue, Oct 31, 2023 16:52 PM To: @.>; Cc: @.@.>; Subject: Re: [Rikorose/DeepFilterNet] I've made a torch reimplementation forboth offline and streaming implementation. Would you be interested inaccepting this contribution? (Issue #430)

@308627993

You can find it here - https://github.com/Rikorose/DeepFilterNet/blob/5224ed47e19b6b327c9f22df6c5a061e3c3f6d5f/torchDF/torch_df_streaming.py

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

308627993 avatar Oct 31 '23 15:10 308627993

@grazder Hey! Thank you for your work on this. It's really cool to see the torch reimplementation running!

I wanted to get your input on what I think is a bug with the streaming model.

If you listen to this collection of samples there is some light chop in the streaming version that does not appear in the offline or normal DF3 output. I think you can hear it best near the end.

Do you have any thoughts on what might be causing it?

ZacharyYarostTechsmith avatar Nov 02 '23 20:11 ZacharyYarostTechsmith

Well, you can try infer with apply_all_stages=True parameter (in init), so you will get output like in offline model. I bet problem here. It's made to reach full match with rust implementation

grazder avatar Nov 02 '23 20:11 grazder

Well, you can try infer with apply_all_stages=True parameter (in init), so you will get output like in offline model. I bet problem here. It's made to reach full match with rust implementation

I appreciate the reply! I just gave that a try but the output still retained the choppiness.

Does anything else come to mind that be might at play?

ZacharyYarostTechsmith avatar Nov 02 '23 21:11 ZacharyYarostTechsmith

Can you send an audio please? I'll try to debug it tomorrow

grazder avatar Nov 02 '23 22:11 grazder

Can you send an audio please? I'll try to debug it tomorrow

Of course. Thanks for looking into it!

I've attached the original sample and the two denoised versions from my video.

AudioSamples.zip

ZacharyYarostTechsmith avatar Nov 02 '23 23:11 ZacharyYarostTechsmith