wstd-eq Plugin consumes even more CPU when idle

Hi Wasted Audio Team,

I've encountered a strange issue when using WSTD EQ on REAPER for Linux. If the plugin is processing audio, CPU usage is below 1.0% on average. However, when I click "Stop" on REAPER, CPU usage will terribly increase to 7.0%.

See the following screenshots:

On processing:

On idle:

My system environment:

PC: ThinkPad R400
CPU: Intel(R) Core(TM)2 Duo CPU P9500 @ 2.53GHz
OS: Arch Linux
DAW: REAPER v6.83
WSTD EQ version: v1.0 (official release)

Jan 03 '24 13:01 AnClark

This is a Linux perf stat when testing with REAPER (VST3 edition). I stayed for a little long time on idle.

perf.data.tar.gz.

Here are some screenshots:

Jan 03 '24 13:01 AnClark

Hey @AnClark thank you for the detailed report.

This will take some dedicated time to figure out. I haven't seen this issue before and I can't directly think of what could cause it.

It might be related to the framework we use. Have you used any other DPF based plugins before that show a similar load increase when transport is stopped?

Jan 03 '24 18:01 dromer

Hmm, so from your perf inspection it seems that the biquad filters take a lot of time on your machine

I'm quickly trying this with the v1.0 release in REAPER on my AMD Ryzen 5 (quite a bit more performant than your ancient C2D). And I don't see any such discrepancies:

Idle:

Active:

Occasionally I see a tiny "jump", when stopping, to 0.03% but it quickly goes down to 0.02% again. Not sure how else I could reproduce.

I am considering to enable SSE4.1 for all plugins this year, which should give a near 4x performance increase. This instruction set is supported for C2D at least. Maybe we can do some preliminary tests to see if this improves this a bit for you.

Jan 03 '24 19:01 dromer

Btw it seems your perf.data file is incompatible with my system, so I cannot read the output myself.

I'm guessing those visual stats are also an extra feature of that version, which I don't seem to have.

Jan 03 '24 19:01 dromer

Occasionally I see a tiny "jump", when stopping, to 0.03% but it quickly goes down to 0.02% again.

Not sure how else I could reproduce.

Here's another way you can reproduce the issue:

Add a new track, and load WSTD EQ;
Create a new MIDI item, and add JSFX "White Noise Generator" to Take FX;
Switch on repeat (activate the "Toggle Repeat" button);
Play.

WSTD EQ still consumes more CPU when I stopped playing.

It's strange that during processing, biquad filter works perfectly. Only if I stopped transport, the filter begins to consume CPU.

Is it possible that any inappropriate samples were being processed by DSP, which made it misbehave?

Jan 04 '24 13:01 AnClark

It might be related to the framework we use. Have you used any other DPF based plugins before that show a similar load increase when transport is stopped?

Yes. I'm porting some LV2 plugins to DPF. Both of the following plugins used to have similar issue:

Both of them have a Moog-style filter. If I stopped transport, the filter will increase the CPU load tremendously. Currently I didn't figure out why it happens, so I just made a workaround: bypass filters if oscillators does not send samples to them.

Jan 04 '24 13:01 AnClark

Here's another way you can reproduce the issue:

1. Add a new track, and load WSTD EQ;

2. Create a new MIDI item, and add JSFX "White Noise Generator" to Take FX;

3. Switch on repeat (activate the "Toggle Repeat" button);

4. Play.

I tried following these instructions. I have a midi section, JS: White noise Generator, then VST3: WSTD EQ. playing this selection on repeat and playing or not playing it doesn't get beyond 0.03%

Jan 04 '24 21:01 dromer

I’ve checked __hv_biquad_f() in generated code.

Seems that newer CPU like your Ryzen 5 enabled solution(s) optimized with AVX or SSE 4.1, while my ancient C2D only supports SSE and SSE2, so it fallbacks to this simple solution:

  const float y = bIn*bX0 + o->xm1*bX1 + o->xm2*bX2 - o->ym1*bY1 - o->ym2*bY2;
  o->xm2 = o->xm1; o->xm1 = bIn;
  o->ym2 = o->ym1; o->ym1 = y;
  *bOut = y;

However it's still strange: this solution performs quite well on transport, but CPU load increases when transport stops.

Full `__hv_biquad_f()` code in WSTD EQ

#if _WIN32 && !_WIN64
void __hv_biquad_f_win32(SignalBiquad *o, hv_bInf_t *_bIn, hv_bInf_t *_bX0, hv_bInf_t *_bX1, hv_bInf_t *_bX2, hv_bInf_t *_bY1, hv_bInf_t *_bY2, hv_bOutf_t bOut) {
  hv_bInf_t bIn = *_bIn;
  hv_bInf_t bX0 = *_bX0;
  hv_bInf_t bX1 = *_bX1;
  hv_bInf_t bX2 = *_bX2;
  hv_bInf_t bY1 = *_bY1;
  hv_bInf_t bY2 = *_bY2;
#else
void __hv_biquad_f(SignalBiquad *o, hv_bInf_t bIn, hv_bInf_t bX0, hv_bInf_t bX1, hv_bInf_t bX2, hv_bInf_t bY1, hv_bInf_t bY2, hv_bOutf_t bOut) {
#endif
#if HV_SIMD_AVX
  __m256 x = _mm256_permute_ps(bIn, _MM_SHUFFLE(2,1,0,3));  // [3 0 1 2 7 4 5 6]
  __m256 y = _mm256_permute_ps(o->x, _MM_SHUFFLE(2,1,0,3)); // [d a b c h e f g]
  __m256 n = _mm256_permute2f128_ps(y,x,0x21);              // [h e f g 3 0 1 2]
  __m256 xm1 = _mm256_blend_ps(x, n, 0x11);                 // [h 0 1 2 3 4 5 6]

  x = _mm256_permute_ps(bIn, _MM_SHUFFLE(1,0,3,2));  // [2 3 0 1 6 7 4 5]
  y = _mm256_permute_ps(o->x, _MM_SHUFFLE(1,0,3,2)); // [c d a b g h e f]
  n = _mm256_permute2f128_ps(y,x,0x21);              // [g h e f 2 3 0 1]
  __m256 xm2 = _mm256_blend_ps(x, n, 0x33);          // [g h 0 1 2 3 4 5]

  __m256 a = _mm256_mul_ps(bIn, bX0);
  __m256 b = _mm256_mul_ps(xm1, bX1);
  __m256 c = _mm256_mul_ps(xm2, bX2);
  __m256 d = _mm256_add_ps(a, b);
  __m256 e = _mm256_add_ps(c, d); // bIn*bX0 + o->x1*bX1 + o->x2*bX2

  float y0 = e[0] - o->ym1*bY1[0] - o->ym2*bY2[0];
  float y1 = e[1] - y0*bY1[1] - o->ym1*bY2[1];
  float y2 = e[2] - y1*bY1[2] - y0*bY2[2];
  float y3 = e[3] - y2*bY1[3] - y1*bY2[3];
  float y4 = e[4] - y3*bY1[4] - y2*bY2[4];
  float y5 = e[5] - y4*bY1[5] - y3*bY2[5];
  float y6 = e[6] - y5*bY1[6] - y4*bY2[6];
  float y7 = e[7] - y6*bY1[7] - y5*bY2[7];

  o->x = bIn;
  o->ym1 = y7;
  o->ym2 = y6;

  *bOut = _mm256_set_ps(y7, y6, y5, y4, y3, y2, y1, y0);
#elif HV_SIMD_SSE
  __m128 n = _mm_blend_ps(o->x, bIn, 0x7); // [a b c d] [e f g h] = [e f g d]
  __m128 xm1 = _mm_shuffle_ps(n, n, _MM_SHUFFLE(2,1,0,3)); // [d e f g]
  __m128 xm2 = _mm_shuffle_ps(o->x, bIn, _MM_SHUFFLE(1,0,3,2)); // [c d e f]

  __m128 a = _mm_mul_ps(bIn, bX0);
  __m128 b = _mm_mul_ps(xm1, bX1);
  __m128 c = _mm_mul_ps(xm2, bX2);
  __m128 d = _mm_add_ps(a, b);
  __m128 e = _mm_add_ps(c, d);

  const float *const bbe = (float *) &e;
  const float *const bbY1 = (float *) &bY1;
  const float *const bbY2 = (float *) &bY2;

  float y0 = bbe[0] - o->ym1*bbY1[0] - o->ym2*bbY2[0];
  float y1 = bbe[1] - y0*bbY1[1] - o->ym1*bbY2[1];
  float y2 = bbe[2] - y1*bbY1[2] - y0*bbY2[2];
  float y3 = bbe[3] - y2*bbY1[3] - y1*bbY2[3];

  o->x = bIn;
  o->ym1 = y3;
  o->ym2 = y2;

  *bOut = _mm_set_ps(y3, y2, y1, y0);
#elif HV_SIMD_NEON
  float32x4_t xm1 = vextq_f32(o->x, bIn, 3);
  float32x4_t xm2 = vextq_f32(o->x, bIn, 2);

  float32x4_t a = vmulq_f32(bIn, bX0);
  float32x4_t b = vmulq_f32(xm1, bX1);
  float32x4_t c = vmulq_f32(xm2, bX2);
  float32x4_t d = vaddq_f32(a, b);
  float32x4_t e = vaddq_f32(c, d);

  float y0 = e[0] - o->ym1*bY1[0] - o->ym2*bY2[0];
  float y1 = e[1] - y0*bY1[1] - o->ym1*bY2[1];
  float y2 = e[2] - y1*bY1[2] - y0*bY2[2];
  float y3 = e[3] - y2*bY1[3] - y1*bY2[3];

  o->x = bIn;
  o->ym1 = y3;
  o->ym2 = y2;

  *bOut = (float32x4_t) {y0, y1, y2, y3};
#else
  const float y = bIn*bX0 + o->xm1*bX1 + o->xm2*bX2 - o->ym1*bY1 - o->ym2*bY2;
  o->xm2 = o->xm1; o->xm1 = bIn;
  o->ym2 = o->ym1; o->ym1 = y;
  *bOut = y;
#endif
}

Jan 04 '24 23:01 AnClark

As I said we do not build with SIMD optimizations yet (only on ARM).

Your CPU should support SSE4.1 which I might enable later this year. C2D is about 15 years old now.

You could try this optimization by adding -msse41 to the CXXFLAGS in the plugin/source/Makefile.

Jan 05 '24 06:01 dromer

I have a newer ThinkPad X201 Tablet. It has a Core 1st Gen processor (Core i7 L 640).

I enabled -msse41, and tested again. Even though SIMD instructions reduced CPU usages by 1.0% on idle, the problem still exists.

Sounds like we have something to do with the algorithm.

Jan 05 '24 09:01 AnClark

For reference, here's a Moog-style filter from RaffoSynth, which has the same problem as I described:

//hace lo mismo que la versión en asm
void equalizer(float* buffer, float* prev_vals, uint32_t sample_count, float psuma0, float psuma2, float psuma3, float ssuma0, float ssuma1, float ssuma2, float ssuma3, float factorSuma2){
    float psuma1 = psuma0 *2;
  for (int i = 0; i < sample_count; i++) {
    //low-pass filter    

    float temp = buffer[i];
    buffer[i] *= psuma0; 	//psuma0 == factorsuma1
    buffer[i] += psuma0 * prev_vals[0] + psuma1 * prev_vals[1] 
                    + psuma2 * prev_vals[2] + psuma3* prev_vals[3];
    prev_vals[0] = prev_vals[1];
    prev_vals[1] = temp;
    
    // peaking EQ (resonance)
    float temp2 = buffer[i];

    buffer[i] *= factorSuma2;
    buffer[i] += ssuma0 * prev_vals[2] + ssuma1 * prev_vals[3] 
                    + ssuma2 * prev_vals[4] + ssuma3 * prev_vals[5];
    prev_vals[2] = prev_vals[3];
    prev_vals[3] = temp;
    prev_vals[4] = prev_vals[5];
    prev_vals[5] = buffer[i];
 	}
}

Jan 05 '24 12:01 AnClark

I got a hint from FalkTX on what could be going on. Can you perhaps try the following?

To the top of WSTD_EQ/plugin/source/HeavyDPF_WSTD_EQ.cpp add

#include "extra/ScopedDenormalDisable.hpp"

And in the run function set the following:

  const ScopedDenormalDisable sdd;
  const TimePosition& timePos(getTimePosition());

Rebuild and try again.

Jan 05 '24 19:01 dromer

@dromer OK. I'll try tonight (BJT), and give you report.

Jan 06 '24 02:01 AnClark

@AnClark you can try this build when it finishes: https://github.com/Wasted-Audio/wstd-eq/actions/runs/7431093334

Jan 06 '24 10:01 dromer

I got a hint from FalkTX on what could be going on. Can you perhaps try the following?

To the top of WSTD_EQ/plugin/source/HeavyDPF_WSTD_EQ.cpp add
#include "extra/ScopedDenormalDisable.hpp"
And in the run function set the following:
  const ScopedDenormalDisable sdd;
  const TimePosition& timePos(getTimePosition());
Rebuild and try again.

Great! By adding those lines, and build with -O3 CXX flag, problem resolved. Now CPU usage is about 0.6% on idle.

Jan 06 '24 14:01 AnClark

@AnClark you can try this build when it finishes: https://github.com/Wasted-Audio/wstd-eq/actions/runs/7431093334

I've also tested your build.

Your build has better performance than mine. CPU usage is not beyond 0.5% on idle. So disabling denormal numbers really works.

Jan 06 '24 14:01 AnClark

Cool! thank you for confirming. I guess on older systems as yours this really makes a difference. On my machines I couldn't spot any significant change.

Now comes the question on how to best apply this, as setting this option can potentially break things as well ..

Jan 06 '24 14:01 dromer

My pleasure!

It would be better if there were any document for ScopedDenormalDisable. It's the first time I know this API. I wonder if it's proved stable by FalkTX and contributors.

Also you can do more tests on other platforms, including Apple Silicon. All of my machines are not newer than Core i5 5th-Gen.

Jan 06 '24 14:01 AnClark

I do not own any Windows or MacOS machines, so doing "proper" testing on those is not possible. What I generally do is pass builds to friends and ask them to report if it works :shrug:

Jan 06 '24 14:01 dromer

Btw the only documentation for this class is in the code: https://github.com/DISTRHO/DPF/blob/main/distrho/extra/ScopedDenormalDisable.hpp

Jan 06 '24 15:01 dromer

I've found a solution: add a new entry in HVCC JSON metadata (e.g. dpf.enable_denormal_number_fix or other better name), to control whether to enable this fix or not. So we can only apply this fix on WSTD EQ, and let other products uneffected.

What's more, we can also provide 2 builds of WSTD EQ since next release. One applys this fix, and the other one keeps as-is.

Jan 06 '24 23:01 AnClark

I don't see any reason to provide two completely separate builds of the same plugin, that doesn't make any sense. Either such a patch will be in place, or it won't.

Having it as a configurable option in the json is a nice idea, so it won't be put there automatically for all DPF builds. I'd like to know more about the implications of the patch and how it could disrupt plugin and host behavior before moving forward with a permanent solution.

Jan 06 '24 23:01 dromer

Maybe I can help test on Windows (as well as Wine). I have a Hewlett-Packard Pavillion with Windows 11 and Msys2 installed (though it uses i7-5500U).

What's more, if WSTD and HVCC had unit test (or benchmark test) it would also help a lot.

Jan 06 '24 23:01 AnClark

HVCC does have some testing in place (although not everything works), but that's a discussion for a different project :)

Jan 06 '24 23:01 dromer

So how could we do tests? Maybe we can make a roadmap for testing plugins (maybe not limited to WSTD EQ). For example, specify test cases and target DAWs.

Jan 07 '24 01:01 AnClark