FIRBenchmarks New promising benchmark using Fastor C++

Hi. I've add a new class with a naive implementation of the inner product using Fastor C++ (https://github.com/romeric/Fastor). Fastor is header-only so it's easy to include. Fastor is a very good framework that use static compile-time tensors to do some linear algebra (and more). In the case of FIR, where the kernel size can be known at compile-time, we can use it here:

FastorFIR.h

#ifndef FASTORFIR_H_INCLUDED
#define FASTORFIR_H_INCLUDED

#include "Fastor/Fastor.h"
#include "BaseFilter.h"

template <size_t N>
class FastorFIR : public BaseFilter
{
public:
    FastorFIR()
    {}

    String getName() const override { return "FastorFIR"; }

    void prepare(double /*sampleRate*/, int /*samplesPerBlock*/) override
    {
        zPtr = 0; // reset state pointer
        z.zeros(); // clear existing state
    }

    void loadIR(const AudioBuffer<float>& irBuffer) override
    {
        auto* data = irBuffer.getReadPointer(0);
        std::copy(data, &data[N], h.data());
    }

    void processBlock(AudioBuffer<float>& b) override
    {
        auto* buffer = b.getWritePointer(0);
        const int numSamples = b.getNumSamples();

        for (int n = 0; n < numSamples; ++n)
        {
            // insert input into double-buffered state
            z(zPtr + N) = z(zPtr) = buffer[n];

            // compute inner product
            Fastor::Tensor<float, N> zn = z(Fastor::seq(zPtr, zPtr + N, 1));
            buffer[n] = Fastor::inner(zn, h);

            zPtr = (zPtr == 0 ? N - 1 : zPtr - 1); // iterate state pointer in reverse
        }
    }

protected:
    Fastor::Tensor<float, N> h;

private:
    Fastor::Tensor<float, N * 2> z;
    int zPtr;
};

#endif // FASTORFIR_H_INCLUDED

I need to modify the main.cpp to allow the compile-time over different kernel size:

main.cpp

[...]

int main()
{
    std::unique_ptr<BaseFilter> fastorStaticInstances[12]=
    {
        std::make_unique<FastorFIR<16>>(),
        std::make_unique<FastorFIR<17>>(),
        std::make_unique<FastorFIR<31>>(),
        std::make_unique<FastorFIR<32>>(),
        std::make_unique<FastorFIR<64>>(),
        std::make_unique<FastorFIR<67>>(),
        std::make_unique<FastorFIR<127>>(),
        std::make_unique<FastorFIR<128>>(),
        std::make_unique<FastorFIR<256>>(),
        std::make_unique<FastorFIR<257>>(),
        std::make_unique<FastorFIR<509>>(),
        std::make_unique<FastorFIR<512>>(),
    };

    // if running debug mode, check the accuracy
    // of each FIR processor
#if JUCE_DEBUG
    testAccuracies();
#endif

[...]

    auto n = 0;
    for (int irSize : irSizes)
    {
        std::cout << "Running with IR size: " << irSize << " samples" << std::endl;

        auto irBuffer = createRandomBuffer (rand, irSize);

        std::vector<std::unique_ptr<BaseFilter>> filters;
        filters.push_back (std::make_unique<JuceConvolution>());
        filters.push_back (std::make_unique<JuceFIR>());
        filters.push_back (std::make_unique<InnerProdFIR> (irSize));
        filters.push_back (std::make_unique<SimdFIR> (irSize));
        filters.push_back (std::move(fastorStaticInstances[n]));

[...]

        ++n;

        std::cout << std::endl;
    }

[...]

    // Run check for each fliter
    std::vector<std::unique_ptr<BaseFilter>> filters;
    filters.push_back (std::make_unique<InnerProdFIR> (irSize));
    filters.push_back (std::make_unique<SimdFIR> (irSize));
    filters.push_back (std::make_unique<FastorFIR<irSize>> ());

Even if Fastor is designed to have very good performance on little size tensor, says 3x3, 5x5 ... here the slicing view used to move around the state tensor is a bottleneck on the small kernel sizes and i'm pretty sure that a assignement operator like that in a loop is in any case the bad way to do a good job. I'm not very familiar with Fastor but I can ask for a little help from Roman that is the author of the framework.

Dell Workstation 
Specification		Intel(R) Xeon(R) CPU E5410  @ 2.33GHz
Instructions sets	MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, EM64T, VT-x

So this is the benchmarks on my (old) computer :

--------------------------------------------------------------------------------
POWER OF 2 BENCHMARKS
--------------------------------------------------------------------------------
             |    16    |    32    |    64    |    128   |    256   |    512   |
--------------------------------------------------------------------------------
     JuceFIR |  11.1443 |  31.8635 |  52.3454 |  92.4750 |  172.238 |  332.482 |
InnerProdFIR |  11.5063 |  19.8507 |  39.0645 |  78.8536 |  158.520 |  327.190 |
     SimdFIR |  13.9615 |  22.7610 |  38.8627 |  73.6769 |  140.121 |  289.668 |
   FastorFIR |  12.5692 |  19.2430 |  33.1792 |  60.0263 |  117.296 |  223.924 |
--------------------------------------------------------------------------------
 Gain o/best |  +1.4249 |  -0.6077 |  -5.6835 | -13.6506 |  -22.825 |  -65.744 |
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
PRIME BENCHMARKS
--------------------------------------------------------------------------------
             |    17    |    31    |    67    |    127   |    257   |    509   |
--------------------------------------------------------------------------------
     JuceFIR |  11.7722 |  30.4538 |  55.2651 |  91.8949 |  172.757 |  330.699 |
InnerProdFIR |  12.2105 |  20.3858 |  41.5768 |  79.1687 |  159.243 |  325.250 |
     SimdFIR |  14.1141 |  22.6684 |  40.3495 |  73.1771 |  140.577 |  285.678 |
   FastorFIR |  14.0457 |  21.5038 |  37.6908 |  66.3931 |  114.819 |  228.626 |
--------------------------------------------------------------------------------
 Gain o/best |  +2.2735 |  +1.1180 |  -2.6587 |  -6.7840 |  -25.758 |  -57.052 |
--------------------------------------------------------------------------------

The full benchmark output is here:

SIMD size: 4
Running with IR size: 16 samples
JuceConv: 82.5784
JuceFIR: 11.1443
InnerProdFIR: 11.5063
SimdFIR: 13.9615
FastorFIR: 12.5692

Running with IR size: 17 samples
JuceConv: 82.5697
JuceFIR: 11.7722
InnerProdFIR: 12.2105
SimdFIR: 14.1141
FastorFIR: 14.0457

Running with IR size: 31 samples
JuceConv: 82.5479
JuceFIR: 30.4538
InnerProdFIR: 20.3858
SimdFIR: 22.6684
FastorFIR: 21.5038

Running with IR size: 32 samples
JuceConv: 82.6034
JuceFIR: 31.8635
InnerProdFIR: 19.8507
SimdFIR: 22.761
FastorFIR: 19.243

Running with IR size: 64 samples
JuceConv: 82.4876
JuceFIR: 52.3454
InnerProdFIR: 39.0645
SimdFIR: 38.8627
FastorFIR: 33.1792

Running with IR size: 67 samples
JuceConv: 82.5303
JuceFIR: 55.2651
InnerProdFIR: 41.5768
SimdFIR: 40.3495
FastorFIR: 37.6908

Running with IR size: 127 samples
JuceConv: 82.6474
JuceFIR: 91.8949
InnerProdFIR: 79.1687
SimdFIR: 73.1771
FastorFIR: 66.3931

Running with IR size: 128 samples
JuceConv: 82.4483
JuceFIR: 92.475
InnerProdFIR: 78.8536
SimdFIR: 73.6769
FastorFIR: 60.0263

Running with IR size: 256 samples
JuceConv: 82.6541
JuceFIR: 172.238
InnerProdFIR: 158.52
SimdFIR: 140.121
FastorFIR: 117.296

Running with IR size: 257 samples
JuceConv: 82.4801
JuceFIR: 172.757
InnerProdFIR: 159.243
SimdFIR: 140.577
FastorFIR: 114.819

Running with IR size: 509 samples
JuceConv: 82.4918
JuceFIR: 330.699
InnerProdFIR: 325.25
SimdFIR: 285.678
FastorFIR: 228.626

Running with IR size: 512 samples
JuceConv: 83.9669
JuceFIR: 332.482
InnerProdFIR: 327.19
SimdFIR: 289.668
FastorFIR: 223.924

I hope that can help to find the best FIR engine because in real-time audio processing each milliseconds is precious !

Sincerely, Max

Jun 13 '21 17:06 ghost

Ah, this is very interesting, thanks for sharing! It's definitely good to see the improved performance, especially at FIR sizes of 64 and above. That's said, I'm not sure it's a 100% fair comparison since the filter size is fixed at compile time...

Does Fastor have dynamically resizable tensors, i.e. Fastor::Tensor<float, Fastor::Dynamic> h; (I'm thinking similar to what the Eigen library offers)? If not, maybe a good work-around would be to choose a maximum FIR size, and use that for the size of each tensor, kind of like this:

#ifndef FASTORFIR_H_INCLUDED
#define FASTORFIR_H_INCLUDED

#include "Fastor/Fastor.h"
#include "BaseFilter.h"

template <size_t MAX_SIZE=4096>
class FastorFIR : public BaseFilter
{
public:
    FastorFIR(int order) : order(order)
    {}

    String getName() const override { return "FastorFIR"; }

    void prepare(double /*sampleRate*/, int /*samplesPerBlock*/) override
    {
        zPtr = 0; // reset state pointer
        z.zeros(); // clear existing state
    }

    void loadIR(const AudioBuffer<float>& irBuffer) override
    {
        auto* data = irBuffer.getReadPointer(0);
        std::copy(data, &data[order], h.data());
    }

    void processBlock(AudioBuffer<float>& b) override
    {
        auto* buffer = b.getWritePointer(0);
        const int numSamples = b.getNumSamples();
        auto hn = z(Fastor::seq(0, order, 1));

        for (int n = 0; n < numSamples; ++n)
        {
            // insert input into double-buffered state
            z(zPtr + order) = z(zPtr) = buffer[n];

            // compute inner product
            auto zn = z(Fastor::seq(zPtr, zPtr + order, 1));
            buffer[n] = Fastor::inner(zn, hn);

            zPtr = (zPtr == 0 ? N - 1 : zPtr - 1); // iterate state pointer in reverse
        }
    }

protected:
    Fastor::Tensor<float, MAX_SIZE> h;

private:
    const int order;
    Fastor::Tensor<float, MAX_SIZE * 2> z;
    int zPtr;
};

#endif // FASTORFIR_H_INCLUDED

Jun 14 '21 16:06 jatinchowdhury18

Sure that's not fair (compile-time static size) !!! For the corrected class just a little typo at hn instanciation (you use z) but yes that can do the trick. I have use Fastor for the final backend of the ACME port and the result is very good but I'm sure that can be better cause I think my build mode was not very well configured.

Note that Fastor include very good implementations of SIMD that can be use with or without the main tensor : see smid_vector and simd_math. And in more specific way, it's possible to use different backend for SIMD, citation : "Optional SIMD backends such as sleef, Vc or even std::experimental::simd". Maybe an alternative to the Juce or XSIMD ones?

EDIT1 : Yes, dynamic resize is possible but with constraints on the global performance. EDIT2 : Just notice that auto zn = z(Fastor::seq(zPtr, zPtr + order, 1)); will produce error to build at buffer[n] = Fastor::inner(zn, hn);, that's why I suspect my casting to be a bottleneck on small kernel size. As I understand, the Fastor::seq do not provide a new subtensor, but a static view on the current tensor. The problem come where the input type of Fastor::inner. I have open an issue, maybe something misunderstood from me.

Jun 14 '21 17:06 ghost

Gotcha, that makes sense.

EDIT2 : Just notice that auto zn = z(Fastor::seq(zPtr, zPtr + order, 1)); will produce error to build at buffer[n] = Fastor::inner(zn, hn);, that's why I suspect my casting to be a bottleneck on small kernel size. As I understand, the Fastor::seq do not provide a new subtensor, but a static view on the current tensor. The problem come where the input type of Fastor::inner. I have open an issue, maybe something misunderstood from me.

This is interesting... I guess Fastor::inner needs to know the input sizes at compile-time?

Jun 15 '21 15:06 jatinchowdhury18

Yes and no. Yes because the Fastor::inner require same type, same size. No, because the concept is that seq produce a view not a copy of the Tensor, but when I force the type by the assignement operator, I think that it's equivalent of a call to evaluate so to a copy of the Tensor region, so bad performance when the kernel size is small.

I think the problem is that the type of the Fastor::inner input in a single line is something like Fastor::inner(TensorView, Tensor) and this is not conform to the method signature. That's sad because Fastor is optimized for small Tensor size.

I'm sure that a solution exists, even to use a trick to avoid the loop over the sample buffer but i'm not an expert. I will investigate this question because that can be very usefull in the Lagrange fractional delay (very small but intensive FIR) !!!

Jun 15 '21 19:06 ghost

For information, I have found a method to call Fastor::inner directly:

buffer[n] = Fastor::inner<float, N>(z(Fastor::seq(p, p + N)), 
                                    h(Fastor::seq(0, N)));

I need to force template specification because z is 2N where h is N size. Strangely, the performances are really bad.

Low-order kernel size example:

Running with IR size: 4 samples
JuceConv: 93.7853
JuceFIR: 5.47821
InnerProdFIR: 8.4689
SimdFIR: 11.325
FastorFIR: 43.8663

Running with IR size: 5 samples
JuceConv: 91.9068
JuceFIR: 5.95146
InnerProdFIR: 8.98747
SimdFIR: 11.8097
FastorFIR: 53.0712

I have test a linear-algebra method where z is a single-buffer of size N (% = matrix-vector product in Fastor) :

for (int n = 0; n < numSamples; ++n)
{
    auto x = buffer[n];
    // next-state equation
    z = M % z + h * x;
    // output equation
    buffer[n] = z(N-1);
}

where M is a NxN matrix filled in diagonal (and transposed), ex. 5x5:

This is elegant, that work great, but the performance are poor.

Finally, I've take a look at the Juce's FIR code and adapt a version that use the algorithm in processSingleSample (single size fifo buffer with double loop) but using Fastor at container for the buffer and the coefficients. Surprisingly, I get a little better performance in low-order up to 17 kernel size:

SIMD size: 4
Running with IR size: 4 samples
JuceConv: 83.4352
JuceFIR: 5.53499
InnerProdFIR: 5.61854
SimdFIR: 8.88822
FastorFIR: 5.48788

Running with IR size: 5 samples
JuceConv: 83.3225
JuceFIR: 5.9902
InnerProdFIR: 6.54317
SimdFIR: 9.24106
FastorFIR: 5.93053

Running with IR size: 6 samples
JuceConv: 83.2822
JuceFIR: 7.5819
InnerProdFIR: 11.5463
SimdFIR: 12.8178
FastorFIR: 6.42618

Running with IR size: 7 samples
JuceConv: 83.1906
JuceFIR: 7.24127
InnerProdFIR: 8.82602
SimdFIR: 10.9537
FastorFIR: 6.97789

Running with IR size: 8 samples
JuceConv: 83.2182
JuceFIR: 7.51527
InnerProdFIR: 7.21745
SimdFIR: 11.2855
FastorFIR: 7.39188

Running with IR size: 9 samples
JuceConv: 83.3385
JuceFIR: 7.94522
InnerProdFIR: 10.873
SimdFIR: 11.8924
FastorFIR: 7.80866

Running with IR size: 10 samples
JuceConv: 83.2699
JuceFIR: 9.05214
InnerProdFIR: 8.84311
SimdFIR: 11.0407
FastorFIR: 8.28191

Running with IR size: 11 samples
JuceConv: 83.6409
JuceFIR: 10.5159
InnerProdFIR: 9.70853
SimdFIR: 12.5718
FastorFIR: 8.81647

Running with IR size: 16 samples
JuceConv: 83.2255
JuceFIR: 11.0937
InnerProdFIR: 11.5952
SimdFIR: 14.0986
FastorFIR: 10.9147

Running with IR size: 17 samples
JuceConv: 83.2542
JuceFIR: 12.728
InnerProdFIR: 12.4616
SimdFIR: 14.2279
FastorFIR: 11.5099

I currently work in parallel on a new version of the DWG (Digital Waveguide) framework that is buggy (I've test with a recent version of VS and some code not working anymore). The Lagrange interpolator / deinterpolator is wrong and I've found a more powerfull way to do the process with a more elegant modern C++ code.

That's all for me for the Fastor question.

Sincerely, Max

Jun 25 '21 15:06 ghost

FIRBenchmarks FIRBenchmarks copied to clipboard

New promising benchmark using Fastor C++

FIRBenchmarks
FIRBenchmarks copied to clipboard