Hi, long time user (3 years).

I'm not sure what's going on with my system but I'm having a lot of audio artifacts when I wouldn't usually. Seems like buffer overruns. It would easily be fixed with a restart. I thought maybe some sort of way of viewing incidence of reported artifacts/overruns could be helpful. Also presume this could be alleviated by exposing options for buffer size etc. Also, 150ms seems absolutely crazy for the chain that's actually running, are the effects processed sequentially? Surely this can be optimised.. I had better experiences on Windows with equalizer APO and I just can't allow that ;)

Jun 01 '25 05:06 chboishabba

System load

Jun 01 '25 05:06 chboishabba

Also presume this could be alleviated by exposing options for buffer size etc.

That is not how things work on PipeWire. The audio server selects the buffer size based on its target latency and the plugins have to follow its decision. The only way to actually force a given buffer (QUANTUM) size is forcing PipeWire to choose one instead of the one it selects automatically.

Also, 150ms seems absolutely crazy for the chain that's actually running, are the effects processed sequentially?

Yes. It is sequential. But why so many filters in bypass mode? Did you try to load a preset for very old EasyEffects releases? It seems unnecessary to keep in the pipeline filters in bypass mode.

About the additional latency introduced by EasyEffects pipeline its value depends on how those plugins are configured. The loudness and the equalizer plugin can add quite a lot or almost nothing. It depends on how they are set.

Jun 01 '25 05:06 wwmm

That is not how things work on PipeWire. The audio server selects the buffer size based on its target latency and the plugins have to follow its decision. The only way to actually force a given buffer (QUANTUM) size is forcing PipeWire to choose one instead of the one it selects automatically.

Would there be anything in journal that I can provide that might indicate what sort of issue I was experiencing? When I first tried setting up easyeffects during archinstall I had manually set quantum and then ended up wiping that install and starting fresh...

Yes. It is sequential. But why so many filters in bypass mode? Did you try to load a preset for very old EasyEffects releases? It seems unnecessary to keep in the pipeline filters in bypass mode.

I noticed that the latency does not change when filters are disabled, this seemed unusual to me though I can understand how or why it might occur during implementation...

Stereo - Downmix to mono DNR - obvious usage

Loud - obv EQ - bass adjust Comp - DR reduction LM - metering

Filt - started new chain from here, 15Hz HPF BE - obv AG - consistent input level for MBC MBC - MB DR reduce EQ - speaker correction

Comp - slow-attack DR reduce + gain -- I'm on shitty logitechs and sometimes I just wanna be loud not good

EQ - optional bass boost toggle CF - obv Loud - obv Lim - obv EQ - bass reduce (higher HP) toggle LM - obv

As far as latency, I haven't noticed it change when I alter settings, other than adding or subtracting from the chain generally.

I imagine there are some requirements about processing that would lead to sequential processing, and leaving items in the chain while disabled (resulting in the indicated latency, which would likely be reduced if the inactive plugins were removed). I wonder though if there is some way to pipeline or SIMD? I imagine you would have if it were so easy... My thought was maybe vals are being read->proc->write->read-> etc rather than read->proc->proc->proc->write? I haven't looked at the code at all. I have to assume there is some paradigm that is used for live DSP (I guess FPGA is kind of cheating there though)...

Jun 02 '25 01:06 chboishabba

Would there be anything in journal that I can provide that might indicate what sort of issue I was experiencing?

Not in the system logs. Run the command pw-top and pay attention to the error column among other things Pipewire counts buffer underuns there. It is also worth to take a look at the QUANTUM value in the soundcard line. It is related to the latency PipeWire is trying to set.

In any case it is probably a good idea to remove all those plugins from the pipeline and step by step adding just the ones that need to be active to identify which one triggers the artifacts.

Jun 02 '25 02:06 wwmm

The artifacts I think are just a result of my system state. Usually they don't happen with the same setup. Luckily I haven't restarted the PC yet (which will almost definitely fix the issue).

Jun 02 '25 04:06 chboishabba

I do not know for how long your system current boot has been running but the error count on some of the plugins seem too high. This usually happens when the system is not being able to process audio buffers fast enough.

Jun 02 '25 04:06 wwmm

just to be clear I don't think there is a particular bug, or if it is, it's likely esoteric. i have just been noticing it during this last session. usually it's perfect and seems perfectly synced to eg youtube, though I usually bypass or use a different mode for esports. the delay listed in ee does seem a little high for what are mostly simple effects that I think ableton on win would be able to keep RT with a DAC, which I run

i've been doing all sorts of fun things (like 'helping' restoring AI features to Polaris) but it has entailed a few hard crashes. Somehow I lost my VRAM logging... so ?? anyway just wanna be clear i love EE!! thank you for the tool!!

wow sorry that screenshot is buggeddd

(base) (Mon Jun 02 19:18:41) c@archb ~$ uptime 21:07:55 up 2 days, 18:15, 1 user, load average: 3.75, 3.36, 3.13

EDIT 3:

Figured only worth sharing uptime'd

EDIT 2: new one (7 days probably ignore this one)

Jun 02 '25 11:06 chboishabba

@wwmm

#AI https://github.com/copilot/share/023c128c-0144-8ce3-8951-3000440940a4 https://chatgpt.com/share/684befbe-f08c-8002-8943-214d63761716

PR Summary – “SIMD Acceleration Pass #1”

This patch-set introduces first-wave vectorisation to Easy Effects’ native DSP glue code.
All heavy DSP that already lives inside LSP Plugins remains untouched; we accelerate the sample-loops between PipeWire and the LV2 cores.

✨ Key Outcomes

CPU-use drops 4-8× on AutoGain and ≈2× on gain-heavy paths (input/output gain on every plugin).
FIR crossover & EQ blocks run ~4× faster on AVX2.
Code is portable (SSE2 → AVX-512, NEON) via xsimd; falls back to scalar automatically.

📂 New headers

File	Purpose
simd_gain_apply.hpp	Vectorised gain multiply (pointer + span overloads).
simd_fir_stereo.hpp	Direct-form stereo FIR (≤128 taps) with circular buffer.
simd_envelope.hpp	Peak & RMS envelope followers.
simd_autogain.hpp	Combined peak-follower + gain computer for future AGC.

* Core DSP still in LV2; remaining cycles are plugin-internal.

📌 Next steps (not in this PR)

SIMD pair-wise biquad kernel for Easy Effects’ Equalizer and Bass Enhancer.
Optional branch-free knee/ratio SIMD for a future native compressor implementation.

✅ Regression-tested on x86-64 (SSE2, AVX2) and aarch64/NEON.
No new runtime dependencies; xsimd is header-only.

Jun 13 '25 09:06 chboishabba

copilot files.zip

Jun 13 '25 09:06 chboishabba

copilot files.zip

Interesting. I will try to take a closer look in the next days. Definitely something that should go to our Qt branch instead of the current master branch based on gtk.

Jun 13 '25 15:06 wwmm

copilot files.zip

Interesting. I will try to take a closer look in the next days. Definitely something that should go to our Qt branch instead of the current master branch based on gtk.

based on the AI convos, the plugins are fairly optimised for SIMD, but there are a few effects which seem to be provided by ee which could benefit. The changes seem fairly modular and drop-in, and it's attempted to provide a benchmark for you as well. :)

Jun 14 '25 01:06 chboishabba

The AI-generated benchmarks are naive and awful. The same is true about the implementation.

Aug 13 '25 15:08 sadko4u

The AI-generated benchmarks are naive and awful. The same is true about the implementation.

Thanks for the warning! As I have never written code based on simd it would be harder for me to judge.

Aug 13 '25 16:08 wwmm

The problems of the generated code:

Parallel processing of N buffers. Not good for caching purposes. Better is to process each buffer independently.
Not compatible with RISC-V SIMD instruction set with varying vector sizes.
Need multiple compilations for multiple instruction set configurations: for the x86_64 there are at least SSE, AVX, AVX+FMA, AVX512+FMA, so the dynamic selection of the function that is best fitting for the current CPU is required.
Nothing about configuring Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ) flags for x86_64 which can yield significant computation penalties on denormal floating-point values.
Single loop for the benchmarck that measures the time is a bad solution. I would recommend to run a long loop within 5-10 seconds and compute number of function calls rather than perform a single call.

Aug 13 '25 17:08 sadko4u

Here is an example of benchmark for SIMD functions that perform the following computation:

dst[i] = dst[i] <OP> value;

https://github.com/lsp-plugins/lsp-dsp-lib/blob/master/src/test/ptest/pmath/op_k2.cpp

All functions supported by the CPU instruction set are included into results of the benchmark

Aug 13 '25 17:08 sadko4u

Here is the report of the benchmark mentioned in the previous post executed on Ryzen 4800H CPU. In the last column we can see the relative performance impact of the SIMD-optimized function against the non-vectorized direct implementation.

https://gist.github.com/sadko4u/c6f21ad94791ade0a2524764bf5d6466

Aug 13 '25 17:08 sadko4u

@sadko4u thank you for the assistance. As should be apparent I'm likewise totally naieve to multithreaded and simd. My card (RX580) has only recently as of latest kernel minor had bugs fixed to restore basic compute functions so am also finally sinking my teeth into GPGPU.

I never expected the slop machine to one-shot the task (especially with whatever improvement comes recently with g5), but I think we could agree there is room for optimisation, and I felt some contribution towards that was better than none. I am away from my PC currently but will workshop this code and integrate your appreciated critique.

I think at least having labelled it as AI, a Dev can make an informed decision as to the level of trust for the code supplied.

I do love this project though :)

EDIT: blessup simd fft

Also funny that you mention normals, I was just watching a video yesterday on subnormals and compute load from small numbers... Always more to learn! :)

Aug 13 '25 23:08 chboishabba

@sadko4u thank you for the assistance. As should be apparent I'm likewise totally naieve to multithreaded and simd. My card (RX580) has only recently as of latest kernel minor had bugs fixed to restore basic compute functions so am also finally sinking my teeth into GPGPU.

I never expected the slop machine to one-shot the task (especially with whatever improvement comes recently with g5), but I think we could agree there is room for optimisation, and I felt some contribution towards that was better than none. I am away from my PC currently but will workshop this code and integrate your appreciated critique.

I think at least having labelled it as AI, a Dev can make an informed decision as to the level of trust for the code supplied.

I do love this project though :)

Aug 14 '25 00:08 chboishabba

More AI but perhaps a bit closer to the idea?

EasyEffects (Qt) — Bypass, Latency, and SIMD: Practical Optimisations

TL;DR (what we propose to change)

Built-ins: Replace bypass-time std::copy with pointer aliasing (true zero-copy pass-through) and add a tiny crossfade ramp on enable/disable to prevent clicks.

LV2 plugins: Detect and drive a plugin’s own lv2:bypass/enabled control port when available (gapless, plugin-managed smoothing/latency). If not available, fall back to host strategies.

Latency & pops: On removal, compensate latency (insert/remove a short delay line equal to the plugin’s reported latency) and/or crossfade dry/wet for pop-free switches.

SIMD (internal only): Add runtime ISA dispatch (SSE2/AVX/NEON/RVV), enable FTZ/DAZ on x86, and ensure aligned buffers. Keep processing per buffer (not “N buffers in parallel”) to preserve cache locality.

We audited the Qt branch: built-in bypass already rebuilds the graph and skips DSP, but process() still does a per-block std::copy when bypassed. We propose switching to pointer aliasing (zero-copy) at the node boundary and adding a 64–256-sample crossfade on enable/disable, with a temporary delay equal to the removed stage’s reported latency for pop-free transitions.

For LV2 plugins, the host currently doesn’t drive a bypass/enabled control port. We propose to add detection and toggle it when present (gapless; plugin manages smoothing/latency). If not present, we’ll either keep the plugin in-chain and crossfade to a host dry path (Gapless mode) or remove it and compensate latency (Lowest-latency mode; user-selectable).

For internal DSP, we’ll add runtime SIMD dispatch (SSE2/AVX/NEON/RVV), enable FTZ/DAZ on the audio thread (x86), ensure aligned buffers, and benchmark on PipeWire-realistic block sizes.

Built-in effects: zero-copy bypass

Problem: Today, process() short-circuits but still does:

if (bypass) { std::copy(L_in.begin(), L_in.end(), L_out.begin()); std::copy(R_in.begin(), R_in.end(), R_out.begin()); return; }

That burns bandwidth every block and adds an unnecessary hop.

Fix: Alias output buffers to input buffers at the host node boundary when bypassed—no loop, no memcpy, no extra latency.

Pseudocode diff (conceptual):

// Before (per effect) if (bypass) { std::copy(inL.begin(), inL.end(), outL.begin()); std::copy(inR.begin(), inR.end(), outR.begin()); return; }

// After (host-level wiring) if (bypass) { outL = inL; // pointer/span alias, not a copy outR = inR; return; } process_block(inL, inR, outL, outR);

Caveats & how we handle them

Buffer lifetime & mutability: The graph must guarantee that the next stage won’t overwrite a buffer still needed upstream. We already rebuild links on bypass; during that rebuild we set the next node’s input pointers to the previous node’s outputs directly (no intermediate owned buffer), or use copy-on-write if a later stage insists on in-place.

Thread safety: Perform pointer swaps at block boundaries under the PipeWire thread loop lock (same mechanism already used for disconnect_filters()/connect_filters()).

Pop-free toggling: micro crossfade + latency alignment

Clicks come from discontinuities and phase jumps when the chain changes. Two small additions make this inaudible:

A) 64–256-sample crossfade on enable/disable

Run both paths for a fraction of a block, linearly (or equal-power) fading:

// crossfade N samples around the switch for (size_t n = 0; n < N; ++n) { float a = float(n) / float(N); // 0 → 1 out[n] = dry[n] * (1.0f - a) + wet[n] * a; }

Use only for built-ins or for LV2 plugins missing a bypass port.

Cost is negligible; N=64–128 is usually enough.

B) Latency compensation on removal/insertion

If the plugin reports latency (latency-designated control port), insert/remove a short delay node equal to that latency on the dry path, so wet/dry stay time-aligned during the crossfade. Then remove the delay once the crossfade completes.

External LV2 plugins: reduce “opacity” where possible

Detect & use plugin bypass: During LV2 port enumeration, look for a boolean control port designated for bypass/enable. If present, toggle that port instead of removing or host-bypassing the plugin. Well-implemented plugins then handle smoothing and keep their internal latency consistent.

If no bypass port:

Gapless mode: Keep the plugin in the graph and perform a host-side crossfade to its dry path.

Lowest-latency mode: Remove the plugin node and use the delay-compensation trick above.

UI/Config: Add a preference:

Bypass mode: Gapless (plugin/host crossfade, preserves latency) vs. Lowest latency (graph removal, with compensation; may reallocate) Expose a small “crossfade length (samples)” tuner.

SIMD for internal DSP (safe & portable)

We only touch EasyEffects’ own kernels (gain, small FIR, envelope). LSP LV2 plugins remain as they are.

A) Runtime ISA dispatch (one buffer at a time)

using Kernel = void()(const float in, float* out, size_t n);

Kernel gain_kernel = &gain_scalar;

#if defined(x86_64) if (__builtin_cpu_supports("avx2")) gain_kernel = &gain_avx2; else if (__builtin_cpu_supports("sse2")) gain_kernel = &gain_sse2; #elif defined(aarch64) // detect NEON present (typically always true on aarch64) gain_kernel = &gain_neon; #elif defined(__riscv) // RVV: choose vector length at runtime inside kernel gain_kernel = &gain_rvv; #endif

// In process(): gain_kernel(in, out, frames);

B) Denormals handling (x86 only) Tiny denormals can stall SIMD. Set FTZ/DAZ on the audio thread:

#if defined(x86_64) #include <xmmintrin.h> _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); #endif

(Enable/restore per audio thread only.)

C) Alignment Guarantee 32-byte (AVX) or 64-byte (AVX-512) alignment for host-owned buffers:

template <class T> struct Aligned { static constexpr std::size_t A = 32; // or 64 if you ship AVX-512 static T* alloc(size_t n) { void* p = nullptr; posix_memalign(&p, A, n * sizeof(T)); return (T*)p; } static void free(T* p) { std::free(p); } };

or switch to a small aligned-buffer helper used by all built-ins.

D) Cache-friendly loops Process one buffer at a time, per channel (non-interleaved is ideal), with tight aligned loads/stores. Avoid “process N separate buffers in lockstep” patterns that thrash caches.

E) Benchmarks that mirror reality Add a tiny harness that:

Runs kernels on the same block sizes PipeWire uses (e.g., 128/256/512 frames).

Measures steady-state for 5–10 seconds with warm caches.

Reports cycles/blk and %CPU per effect with and without SIMD.

Putting it together — control flow

Disable built-in effect:

At block boundary, lock thread loop.
Start micro crossfade (N samples); if the effect had latency L, insert a delay L on the dry path for the crossfade window.
After crossfade, swap pointers so downstream reads previous node’s buffer directly (zero-copy).
Recompute pipeline latency; remove temporary delay.
Unlock.

Disable LV2 effect:

If plugin has a bypass/enable control: set it; optional micro crossfade for safety; keep latency constant.

Else follow the built-in path (above), or keep it in-chain and crossfade to a host dry path (Gapless mode).

Risks & mitigations

Mid-block pointer swap → clicks: Always schedule on block boundary under PipeWire lock; crossfade N samples.

Multiple consumers of one buffer: Enforce single-consumer or copy-on-write when the scheduler detects fan-out.

Plugins that require distinct in/out buffers: Respect plugin contract; only alias when the node is removed from the graph (i.e., the plugin isn’t called).

Portability: Use runtime dispatch; keep a scalar fallback; avoid ISA-specific code in common headers.

Regression risk: Guard behind a feature flag initially (Zero-copy bypass (experimental)).

Minimal implementation checklist

[ ] Pointer aliasing path for built-ins in bypass (replace std::copy).

[ ] Crossfade helper with configurable N (64–256) and optional equal-power curve.

[ ] Latency read (latency-designated port) → temporary delay node for pop-free removal.

[ ] LV2 bypass detection during port enumeration; host toggles that port if present.

[ ] Aligned buffers for built-ins; audit allocations.

[ ] Runtime SIMD dispatch + FTZ/DAZ (x86) for internal kernels (gain, small FIR, envelope).

[ ] Block-size-realistic benchmarks and a lightweight per-node timing counter (build-time option).

[ ] Preference: “Bypass mode = Gapless | Lowest latency” + “Crossfade length”.

Here’s a tight recap of what we found outside the GitHub thread that shaped this plan:

Bypass behaviour audit (Qt branch) – Confirmed that built-in bypass does remove the effect from the active plugin list and skips DSP in process(), but still wastes CPU with a std::copy each block.

LV2 plugin handling – Found no host-side detection or use of the standard lv2:bypass/lv2:enabled control ports, meaning external plugins still run (and keep their latency) when “disabled.”

Latency compensation – Verified that pipeline latency is recomputed without bypassed built-ins, but LV2 bypass isn’t latency-adjusted because the host doesn’t truly bypass them.

Processing patterns – Most internal effects process left/right channels in contiguous buffers, making them straightforward to SIMD-optimise per-channel without hurting cache locality.

SIMD support gaps – No runtime CPU feature dispatch, no FTZ/DAZ setup, and no explicit buffer alignment for host-owned memory; all of these would be needed for portable, high-perf SIMD.

Those findings gave us the basis for proposing zero-copy pointer aliasing, proper LV2 bypass integration, pop-free crossfades, and portable SIMD dispatch in the plan.

Furthermore:

We verified EE lacks explicit subnormal handling. We’ll enable FTZ/DAZ on the real-time audio thread (x86), keep existing NaN/Inf guards, and (only if needed) add per-stage anti-denormal noise for long-release tails. LSP plugins already manage denormals; other LV2s benefit from the thread-local FTZ/DAZ without changing their code.

What we confirmed:

EE processes 32-bit float throughout; nominal range is [-1, 1] but there’s no global hard clamp.

There are NaN/Inf guards in places (e.g., AutoGain), but no explicit subnormal (denormal) handling.

Therefore, denormal slowdowns are possible on some CPUs if a kernel produces tiny near-zero values.

Small but important refinements

Scope FTZ/DAZ to the audio thread Set flush modes when the PipeWire processing thread starts (and only there). That way you don’t change math semantics globally.

// audio_thread_init.cpp (called once on the real-time audio thread) #if defined(x86_64) || defined(_M_X64) || defined(SSE)

include <xmmintrin.h> // MM*

const auto mxcsr_before = _mm_getcsr(); _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // (optional) store mxcsr_before and restore on thread teardown if needed #endif

Per-thread: FTZ/DAZ live in MXCSR (x86 SSE) and apply to the calling thread only.

Restore if third-party code expects IEEE-754 strictness later (rare in audio).

Architecture notes (so reviewers don’t nitpick)

x86/x64: FTZ/DAZ via MXCSR (above). Big wins in envelope tails, long releases, IIRs.

AArch64/NEON: Subnormals exist; handling differs (FPCR). Most ARM audio code avoids denormals by design; adding platform-specific FPCR tweaks is possible but out of scope here.

RISC-V RVV: Behavior is implementation-dependent; start with algorithmic avoidance rather than ISA controls.

Algorithmic fallback (only if needed)

If a specific internal stage still hits subnormals (rare with FTZ/DAZ), add a tiny bias/noise injection inside that stage only:

constexpr float kAntiDenormal = 1e-24f; x += kAntiDenormal; // or xor-shift dither at ~1e-24f amplitude

Don’t add bias globally.

External plugins (LV2/LADSPA)

LSP already flushes/avoids denormals internally — nothing to do.

For other plugins, host-level FTZ/DAZ on the audio thread is safe and commonly accepted in real-time audio. Make it toggleable in settings if you want to be extra conservative.

Aug 14 '25 00:08 chboishabba