opus icon indicating copy to clipboard operation
opus copied to clipboard

Should we do `dred_compute_latents` in DTX?

Open shuanzhu opened this issue 6 months ago • 9 comments

We found that the average cost time doubled in silence segment compared with in voiced segment when enable DRED.

After checking the code, dred_compute_latents will always on even in DTX opus would send empty packet, and minor input will cause a huge increase in GRU's calculations. To resolve this problem, we tried to enable OPUS_FLOAT_APPROX, but it didn't work, and then we add _mm_setcsr(csr | 0x8040) when do encode in our code, it works.

To significantly save our CPU, We want to ask you experts should we still do dred_compute_latents in DTX? To be specific, could we change to:

#ifdef ENABLE_DRED
    opus_int32 in_dtx = 0;
    if ( st->dred_duration > 0 && st->dred_encoder.loaded && opus_encoder_ctl(st, OPUS_GET_IN_DTX_REQUEST(in_dtx)) == OPUS_OK && !in_dtx ) {
        int frame_size_400Hz;
        /* DRED Encoder */
        dred_compute_latents( &st->dred_encoder, &pcm_buf[total_buffer*st->channels], frame_size, total_buffer, st->arch );
        frame_size_400Hz = frame_size*400/st->Fs;
        OPUS_MOVE(&st->activity_mem[frame_size_400Hz], st->activity_mem, 4*DRED_MAX_FRAMES-frame_size_400Hz);
        for (i=0;i<frame_size_400Hz;i++)
           st->activity_mem[i] = activity;
    } else {
        st->dred_encoder.latents_buffer_fill = 0;
        OPUS_CLEAR(st->activity_mem, DRED_MAX_FRAMES);
    }
#endif

???

shuanzhu avatar Jun 20 '25 03:06 shuanzhu

So it looks like you're hitting some kind of issue with denormalized float values that can take longer to compute than normal values. There's other cases where we hit that in Opus and typically the solution is to add the VERY_SMALL (1e-30) constant in the right place to avoid the issue. You can see a few examples in celt/celt_decoder.c. So here it would help if you could track down exactly where the issue happens. It may very well be just a single loop causing issues.

jmvalin avatar Jun 20 '25 13:06 jmvalin

Can you try this trivial patch and see if it helps:

diff --git a/dnn/dred_encoder.c b/dnn/dred_encoder.c
index edb49cc2..ffc6d6f8 100644
--- a/dnn/dred_encoder.c
+++ b/dnn/dred_encoder.c
@@ -116,7 +116,7 @@ void filter_df2t(const float *in, float *out, int len, float b0, const float *b,
     for (i=0;i<len;i++) {
         int j;
         float xi, yi, nyi;
-        xi = in[i];
+        xi = in[i] + VERY_SMALL;
         yi = xi*b0 + mem[0];
         nyi = -yi;
         for (j=0;j<order;j++)

We can't just disable dred_compute_latents() during DTX as that would cause a bunch of bad side effects. But just preventing the denorm operations should be sufficient.

jmvalin avatar Aug 15 '25 20:08 jmvalin

I tried with your patch, it seems still has the denormal issue.

The picture attached showed cost time consumed in our app in speech segments and silence segments, each line shows the maximum microseconds in per 20ms frame.

Image

shuanzhu avatar Aug 18 '25 06:08 shuanzhu

I'm unable to reproduce. Please provide either:

  1. a profiling trace, including the function/line(s) that are slow, or
  2. An input file and opus_demo command line that reproduces the problem Without those, it's hard to track down the problem. Also, what CPU are you using?

jmvalin avatar Aug 18 '25 14:08 jmvalin

The input file is in the Materials.zip named "DenormalInput.pcm", and the "opus_timing.txt" is the result I print for per 20ms for opus_encode, the maximum cost time increased in silence part, the command is:

voip 48000 1 85000 -bandwidth SWB -inbandfec -dec_complexity 0 -loss 100 -lossfile loss_10_10_50.txt -dred 30 DenormalInput.pcm DenormalInput_out.pcm

I build the opus project with command: cmake -S. -Bbuild -A X64 -G "Visual Studio 17 2022" -DOPUS_DRED=ON -DOPUS_BUILD_PROGRAMS=ON -DBUILD_TESTING=ON

I use the windows x64 machine, the detail info as below:

Image

Materials.zip

shuanzhu avatar Aug 19 '25 06:08 shuanzhu

Sorry, still cannot reproduce. For me everything runs normally. At this point, the best would be for you to use a profiler (like perf on Linux) and report what's the function (and ideally approximate line number) that's taking a long time to run. Alternatively, you could try adding VERY_SMALL to a bunch of signals and see which one gets things back to normal.

jmvalin avatar Aug 19 '25 15:08 jmvalin

Was able to reproduce on an Intel chip and checked in a fix for what I could reproduce. See if that helps.

jmvalin avatar Nov 19 '25 22:11 jmvalin

Could you clarify which branch and which specific commits you are referring to?

shuanzhu avatar Nov 20 '25 02:11 shuanzhu

Commit 6b151b8be on the main branch: Fixes a denormal issue in DRED encoding

jmvalin avatar Nov 20 '25 19:11 jmvalin