sof icon indicating copy to clipboard operation
sof copied to clipboard

Math: Optimise 16-bit matrix multiplication functions.

Open ShriramShastry opened this issue 1 year ago • 18 comments

Improve mat_multiply and mat_multiply_elementwise for 16-bit signed integers by refactoring operations and simplifying handling of Q0 data.

Changes:

  • Replace int64_t with int32_t for accumulators in mat_multiply and mat_multiply_elementwise, reducing cycle count by ~51.18% for elementwise operations and by ~8.18% for matrix multiplication.

  • Enhance pointer arithmetic within loops for better readability and compiler optimization opportunities.

  • Eliminate unnecessary conditionals by directly handling Q0 data in the algorithm's core logic.

  • Update fractional bit shift and rounding logic for more accurate fixed-point calculations.

Performance gains from these optimizations include a 1.08% reduction in memory usage for elementwise functions and a 36.31% reduction for matrix multiplication. The changes facilitate significant resource management improvements in constrained environments.

Summary:

image

ShriramShastry avatar Apr 29 '24 13:04 ShriramShastry

This looks fine to me but @singalsu needs to review.

Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e. https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c

Is the Cadence's license compatible with all SOF usages (firmware, plugin)?

singalsu avatar May 02 '24 06:05 singalsu

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

Thanks, measurements are good, but I think it would be good to understand which changes improve performance

lyakh avatar May 10 '24 06:05 lyakh

Thank you for your review, and I have made the necessary changes. Please read through and make a suggestion.

Thanks, measurements are good, but I think it would be good to understand which changes improve performance

Apologies for not responding to the question you asked earlier as well. I'm trying to address them here.

Key Points of Improvement (1)Accumulator Data Size Reduction

Original Code: int64_t s; PR GitHub Current Code: int32_t acc; Improvement Explanation: Using a 32-bit accumulator (int32_t rather than int64_t) reduces memory ( number of multipliers i.e. 1/2 for current PR GitHub code) usage per operation and improve processing speed, particularly on 32-bit systems where 64-bit arithmetic is more expensive.

(2) Streamlined Loop Structure and Conditional Logic Original Code (mat_multiply):

if (shift_minus_one == -1) {
    for (...) { // Inner loop
        s += (int32_t)(*x) * (*y);
    }
    *z = (int16_t)s;
}

PR GitHub Current Code (mat_multiply):

acc = 0;
for (..., ..., ...) { // Inner loop
    acc += (int32_t)(*x++) * (*y);
    y += y_inc;
}
*z = (int16_t)(((acc >> shift) + 1) >> 1);

Improvement Explanation: The current PR GitHub code seamlessly integrates the special handling of fractional bits within the main loop, eliminating the need for separate conditionals for each case. This minimises branching and increases loop efficiency.

(3)Optimized Memory Access Patterns

Original Code:Memory access patterns were less optimised due to additional variables and possibly unoptimized pointer arithmetic.

x = a->data + a->columns * i;
y = b->data + j;

Improvement Explanation: More effective memory access and cache utilisation are consequently achieved by the PR GitHub current code by streamlining pointer arithmetic and lowering the number of temporary variables.

Unified Handling of Fractional Adjustments

While fractional bits adjustments are handled by both codes, the PR GitHub current implementation handles them more effectively by handling them directly within the computational loop, which lowers overhead.

int64_t s;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        s = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            s += (int32_t)(*x) * (*y);
            x++;
            y += y_inc;
        }
        *z = (int16_t)(((s >> shift_minus_one) + 1) >> 1);
        z++;
    }
}

PR GitHub Current Code (Snippet for mat_multiply):

int32_t acc;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        acc = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            acc += (int32_t)(*x++) * (*y);
            y += y_inc;
        }
        *z = (int16_t)(((acc >> shift) + 1) >> 1);
        z++;
    }
}

Summary for mat_multiply()

PR GitHub code makes use of smaller data sizes, loop structures, streamlined conditional logic, and optimised memory access patterns to minimise memory usage and speed up matrix multiplication operations.

Key Points of Improvement for mat_multiply_elementwise

Reduction in Accumulator Bit Width for Intermediate Results: Original Code: int64_t p; PR GitHub Current Code: int32_t prod;

Improvement Explanation: By eliminating the need to handle larger data types, using a 32-bit integer (int32_t) for intermediate multiplication results instead of a 64-bit integer (int64_t) can speed up the process on platforms where 32-bit instructions are more efficient.

Streamlining the Conditional Logic for Handling Q0 (shift == -1): Original Code:

if (shift_minus_one == -1) {
    *z = *x * *y;
}

PR GitHub Current Code:

if (shift == -1) {
    *z = *x * *y;
}

Improvement Explanation: There is a special case for situations in which bit shift is not needed in both the original and PR GitHub current codes. Since this check is factored out of the loop, the current code performs it more effectively by using a single comparison rather than repeating it for every iteration.

Original Code (Snippet for mat_multiply_elementwise):

int64_t p;
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;

// Handle no shift case (Q0 data) separately

if (shift_minus_one == -1) {
    for (i = 0; i < a->rows * a->columns; i++) {
        *z = *x * *y;
        x++; y++; z++;
    }
    return 0;
}

// Default case with shift

for (i = 0; i < a->rows * a->columns; i++) {
    p = (int32_t)(*x) * *y;
    *z = (int16_t)(((p >> shift_minus_one) + 1) >> 1);
    x++; y++; z++;
}

PR GitHub Current Code (Snippet for mat_multiply_elementwise):

int32_t prod;
const int shift = a->fractions + b->fractions - c->fractions - 1;

Perform multiplication with or without adjusting the fractional bits

for (int i = 0; i < a->rows * a->columns; i++) {
    if (shift == -1) {
        // Direct multiplication when no adjustment for fractional bits is needed
        *z = *x * *y;
    } else {
        // Multiply elements as int32_t and adjust with rounding
        prod = (int32_t)(*x) * (*y);
        *z = (int16_t)(((prod >> shift) + 1) >> 1);
    }
    x++; y++; z++;
}

Summary for mat_multiply_elementwise() The reduction of the accumulator bit size and the more effective handling of the no-shift case (Q0 data) are the two main optimisations in the elementwise multiplication function. The current code speeds up multiplication operations by reducing the accumulator size. The existing code also refactors how it handles the special case where shifting is not required, which lessen branching and increase the efficiency of the loop.

I would be happy to add if you think that a breakdown of the instructions would be more helpful in a very specific aspects of the code changes between two codes.

ShriramShastry avatar May 10 '24 08:05 ShriramShastry

SOFCI

ShriramShastry avatar May 10 '24 16:05 ShriramShastry

@ShriramShastry thanks for the detailed explanation. Briefly to your points (I removed some explanations for brevity):

Key Points of Improvement (1)Accumulator Data Size Reduction

Ok, this is a clear benefit, as long as there are no drawbacks. But if those accumulators are only used for addition and not too often, I wouldn't expect the performance gain to be huge, but it's still good to use provably correct types for each task.

(2) Streamlined Loop Structure and Conditional Logic Original Code (mat_multiply):

if (shift_minus_one == -1) {
    for (...) { // Inner loop
        s += (int32_t)(*x) * (*y);
    }
    *z = (int16_t)s;
}

PR GitHub Current Code (mat_multiply):

acc = 0;
for (..., ..., ...) { // Inner loop
    acc += (int32_t)(*x++) * (*y);
    y += y_inc;
}
*z = (int16_t)(((acc >> shift) + 1) >> 1);

This kind of changes isn't at all obvious for me why it should improve performance. Maybe important bits got left out in this abbreviated example, but from this it looks like one would need to look at the assembly output to understand where improvements come from.

(3)Optimized Memory Access Patterns

Original Code:Memory access patterns were less optimised due to additional variables and possibly unoptimized pointer arithmetic.

x = a->data + a->columns * i;
y = b->data + j;

Improvement Explanation: More effective memory access and cache utilisation are consequently achieved by the PR GitHub current code by streamlining pointer arithmetic and lowering the number of temporary variables.

Sorry, I don't understand this at all. You say "I've optimised cache utilisation" but you aren't saying where and how.

Unified Handling of Fractional Adjustments

While fractional bits adjustments are handled by both codes, the PR GitHub current implementation handles them more effectively by handling them directly within the computational loop, which lowers overhead.

int64_t s;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        s = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            s += (int32_t)(*x) * (*y);
            x++;
            y += y_inc;
        }
        *z = (int16_t)(((s >> shift_minus_one) + 1) >> 1);
        z++;
    }
}

PR GitHub Current Code (Snippet for mat_multiply):

int32_t acc;
for (i = 0; i < a->rows; i++) {
    for (j = 0; j < b->columns; j++) {
        acc = 0;
        x = a->data + a->columns * i;
        y = b->data + j;
        for (k = 0; k < b->rows; k++) {
            acc += (int32_t)(*x++) * (*y);
            y += y_inc;
        }
        *z = (int16_t)(((acc >> shift) + 1) >> 1);
        z++;
    }
}

The only obvious difference I can see above is the replacement of a 64-bit variable with a 32-bit one. If that's indeed the only improvement, then let's just modify that single line of code and keep the rest.

Streamlining the Conditional Logic for Handling Q0 (shift == -1): Original Code:

if (shift_minus_one == -1) {
    *z = *x * *y;
}

PR GitHub Current Code:

if (shift == -1) {
    *z = *x * *y;
}

I see a pure name replacement above, nothing else. Has nothing to do with performance.

Original Code (Snippet for mat_multiply_elementwise):

int64_t p;
const int shift_minus_one = a->fractions + b->fractions - c->fractions - 1;

// Handle no shift case (Q0 data) separately

if (shift_minus_one == -1) {
    for (i = 0; i < a->rows * a->columns; i++) {
        *z = *x * *y;
        x++; y++; z++;
    }
    return 0;
}

// Default case with shift

for (i = 0; i < a->rows * a->columns; i++) {
    p = (int32_t)(*x) * *y;
    *z = (int16_t)(((p >> shift_minus_one) + 1) >> 1);
    x++; y++; z++;
}

PR GitHub Current Code (Snippet for mat_multiply_elementwise):

int32_t prod;
const int shift = a->fractions + b->fractions - c->fractions - 1;

Perform multiplication with or without adjusting the fractional bits

for (int i = 0; i < a->rows * a->columns; i++) {
    if (shift == -1) {
        // Direct multiplication when no adjustment for fractional bits is needed
        *z = *x * *y;
    } else {
        // Multiply elements as int32_t and adjust with rounding
        prod = (int32_t)(*x) * (*y);
        *z = (int16_t)(((prod >> shift) + 1) >> 1);
    }
    x++; y++; z++;
}

Actually this version should be less optimal because now it has an if inside the loop. Unless something important has been left out in these snippets of course.

lyakh avatar May 13 '24 06:05 lyakh

Actually this version should be less optimal because now it has an if inside the loop. Unless something important has been left out in these snippets of course.

This kind of changes isn't at all obvious for me why it should improve performance. Maybe important bits got left out in this abbreviated example, but from this it looks like one would need to look at the assembly output to understand where improvements come from

Moving the conditional outside the loop reduces computational overhead, simplifies loop logic, and improves efficiency. Similar to fat reduction, this preventive measure shows performance bottlenecks and optimisation strategy.

Sorry, I don't understand this at all. You say "I've optimised cache utilisation" but you aren't saying where and how.

To ensure that pointer arithmetic matches cache line size, memory access optimisations include efficiently reusing the acc variable and accessing matrix data linearly. This reduces cache misses and improves spatial locality. By directly accumulating results, {acc} reduces write-backs to memory, unlike temporary variables.

The only obvious difference I can see above is the replacement of a 64-bit variable with a 32-bit one. If that's indeed the only improvement, then let's just modify that single line of code and keep the rest.

By replacing a 64-bit variable with a 32-bit one and optimising computation within the loop for fractional changes. We simplify fractional bit handling, resulting in lower cycle counts per iteration, especially with faster bit-shifting and addition on 32-bit integers.

I see a pure name replacement above, nothing else. Has nothing to do with performance.

The change from shift_minus_one to shift is part of a larger effort to improve code readability and maintainability. While it's true that this specific renaming does not directly enhance performance, the benefit lies in making the code's intention clearer and the behavior more predictable. By ensuring the variable names accurately reflect their purpose, we reduce the risk of future errors

Actually this version should be less optimal because now it has an if inside the loop. Unless something important has been left out in these snippets of course.

Thank you for bringing this to my attention. I would like to clarify the structure of the modified code and rectify my previous statement regarding the conditional logic for handling shift == -1:

In the PR GitHub current code, the if condition checking whether shift is equal to -1 is indeed placed outside of the iterations loop, which ensures that the check is performed exactly once before entering the loop. This avoids the potential performance penalty associated with repeatedly evaluating the condition within the loop in the original code:

/* Perform multiplication with or without adjusting the fractional bits */
if (shift == -1) {
    /* Direct multiplication when no adjustment for fractional bits is needed */
    for (int i = 0; i < total_elements; i++, x++, y++, z++)
        *z = *x * *y;
} else {
    /* Multiplication with rounding to account for the fractional bits */
    for (int i = 0; i < total_elements; i++, x++, y++, z++) {
        /* Multiply elements as int32_t */
        prod = (int32_t)(*x) * (*y);
        /* Adjust and round the result */
        *z = (int16_t)(((prod >> shift) + 1) >> 1);
    }
}

This structure ensures that the decision regarding which type of multiplication to apply (with or without fractional adjustment) is taken just once, leading to two separate loops that either directly multiply the elements or perform a multiplication that accounts for the fractional bits. This is more efficient than the alternative where the condition check would occur within the loop at every iteration, which would introduce unnecessary branching.

The intent was to streamline the operation by providing two distinct paths, minimizing branching within our loop, and ensuring we execute only the necessary code for a given situation without repeatedly checking the condition. I apologize for the confusion caused by the error in my previous explanation.

This pattern is particularly beneficial in scenarios where the computation inside the loop is substantial, and the condition is invariant across all iterations, so I consider it an optimization over the original implementation.

Loop Counter Scope

for (int i = 0; i < total_elements; i++, x++, y++, z++)

Declaring loop counters within the for loop, when possible, reduces their scope to their necessary locations, enhancing readability and potentially aiding the compiler in optimization.

Constant Expressions

const int total_elements = a->rows * a->columns;
const int shift = a->fractions + b->fractions - c->fractions - 1;

The declaration of loop counters within the for loop aids the compiler's optimisation by limiting scope to the required locations.

Thank you again for your detailed feedback. I look forward to making the necessary adjustments and improvements to the pull request based on our discussion.

ShriramShastry avatar May 14 '24 07:05 ShriramShastry

This looks fine to me but @singalsu needs to review. Long term @ShriramShastry @singalsu thinking aloud, I think it may make more sense to have Kconfig that will use the HiFi3/4/5 kernels from nnlib for xtensa targets i.e. https://github.com/foss-xtensa/nnlib-hifi5/blob/master/xa_nnlib/algo/kernels/matXvec/hifi5/xa_nn_matXvec_16x16.c

Is the Cadence's license compatible with all SOF usages (firmware, plugin)?

Yes, its permissive.

lgirdwood avatar May 14 '24 13:05 lgirdwood

@lyakh @ShriramShastry I think we need to trust the simulator here and that the code is being better optimized and auto vectorized for newer HiFi. i.e. lets not spend time deep diving the assembly. One other thing I would do is to build a pipeline that uses this API and show MCPS before and after. @singalsu which pipeline would you suggest as a good test for above code ?

lgirdwood avatar May 14 '24 13:05 lgirdwood

@singalsu which pipeline would you suggest as a good test for above code ?

I need to make a topology with MFCC, it's the best way to check Sriram's recent optimizations. It should be quite straightforward so I should have it quite soon. A practical topology could use DMIC1 16 kHz device for MFCC so it looks like the best way to start. Also currently there is no other use for DMIC1 device.

singalsu avatar May 14 '24 15:05 singalsu

I ran with testbench some comparison with git main and this patch.

$XTENSA_PATH/xt-run ../../testbench/build_xt_testbench/testbench -q -r 16000 -R 16000 -c 1 -n 1 -b S16_LE -t ../../build_tools/test/topology/test-playback-ssp5-mclk-0-I2S-mfcc-s16le-s16le-48k-24576k-codec.tplg -i in.raw -o mfcc.raw

TGL 96.99 MCPS, with patch 96.74 MCPS, saving 0.25 MCPS MTL 62.15 MCPS, with patch 62.04 MCPS, saving 0.11 MCPS

I'll see if I can load this to a real IPC4 system and see with performance counters the impact. Currently there is no topology for IPC4 so testing of MFCC is very limited.

singalsu avatar Jun 07 '24 14:06 singalsu

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

singalsu avatar Aug 13 '24 16:08 singalsu

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

lgirdwood avatar Aug 14 '24 13:08 lgirdwood

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests.

singalsu avatar Aug 14 '24 15:08 singalsu

I just tested this with MFCC in TGL platform. This patch reduced CPU_PEAK(AVG) from 85.1 to 81.5 MCPS. With LL pipelines the STFT hop/size causes uneven load, so this time using peak as more representative perf measure instead of average.

@singalsu do you have more tests to run or good to go with the optimization after cleanups are resolved ?

I think this is good to merge after suggestions by @lyakh are addressed. Since the matrix operations are not a major part of MFCC these savings are quite significant. So no need for further performance tests.

Great - @ShriramShastry you have a clear path now for merge after @lyakh comments are resolved.

lgirdwood avatar Aug 16 '24 14:08 lgirdwood

Release reminder - one week to v2.11-rc1.

kv2019i avatar Sep 06 '24 10:09 kv2019i

FYI @ShriramShastry pushing to v2.12 (cutoff date today).

kv2019i avatar Sep 13 '24 07:09 kv2019i

Feature cutoff for v2.12, moving this to v2.13.

kv2019i avatar Dec 13 '24 12:12 kv2019i

No update in two releases, moving to TBD milestone.

kv2019i avatar Apr 23 '25 13:04 kv2019i