rtm Optimized the performance of float object

Hello, Thank you for taking the time to review my pull request. Below is a brief overview of the changes and enhancements I've made. Please let me know if there are any questions or further clarifications needed.

Two PRs will be submitted in total; this is the first one.

PR1

This PR mainly focuses on optimizations for float.

In the initial tests, a peculiar result was observed: when testing Matrix on Android, the execution time for float was longer than that for double, which is counterintuitive. Therefore, an analysis was conducted on this part. The first step was to compare the instruction counts of the two test programs, revealing that the instruction count for the float test program was 712,287,424, while that for the double test program was 664,675,474. RTM uses pure C methods to implement double, yet surprisingly, the instruction count for float NEON was even higher than that for double, leading to further analysis. By decompiling the instruction code of the double test code, it was found that the compiler, after optimization, inserted a large number of NEON intrinsics, significantly accelerating performance. The reasons for this optimization include:

Using 16-byte alignment
Extensive use of RTM_FORCE_INLINE for forced inline expansion

As a result, double performed much better than expected, but this only indicates that the compiler's optimization for double code is more aggressive, not that double is inherently faster than float. There must be areas in the float implementation that are more performance-costly, hence the following two optimizations were made:

1. Changing matrix parameter passing from value to reference From the decompiled code of double, it can be seen that the compiler eventually inlines the function, expanding most of the code into a single function call. This disrupts the expected function call stack distribution, rendering RTM's designed argument transmission strategy ineffective. For most function parameters, the value copy method (argx) is used for passing, which inadvertently increases many copy operations. Under the Android ARM64 architecture, the definition of types is as follows:

using matrix3x3f_arg0 = const matrix3x3f;
using matrix3x3f_arg1 = const matrix3x3f;
using matrix3x3f_argn = const matrix3x3f&;

using matrix3x3d_arg0 = const matrix3x3d;
using matrix3x3d_arg1 = const matrix3x3d&;
using matrix3x3d_argn = const matrix3x3d&;

using matrix3x4f_arg0 = const matrix3x4f;
using matrix3x4f_arg1 = const matrix3x4f;
using matrix3x4f_argn = const matrix3x4f&;

using matrix3x4d_arg0 = const matrix3x4d;
using matrix3x4d_arg1 = const matrix3x4d&;
using matrix3x4d_argn = const matrix3x4d&;

using matrix4x4f_arg0 = const matrix4x4f;
using matrix4x4f_arg1 = const matrix4x4f;
using matrix4x4f_argn = const matrix4x4f&;

using matrix4x4d_arg0 = const matrix4x4d;
using matrix4x4d_arg1 = const matrix4x4d&;
using matrix4x4d_argn = const matrix4x4d&;

The settings for passing values by reference differ between float and double., which is one of the reasons for the slower speed of float. By changing the matrix type parameter to reference passing, the test speed under float showed a significant improvement.

2. Modification of the vector_mix function Compared to the conventional shuffle() implementation, RTM's vector_mix() is relatively special, allowing selection at any element position between two vectors, while conventional shuffle() implementations usually have the first two elements from the first vector and the last two from the second vector. This makes RTM's vector_mix() difficult to implement with simple instructions. However, we eventually made some optimizations based on compile-time behavior. The float version of vector_mix() can use __builtin_shufflevector() when compiled with the clang compiler, achieving maximum performance. For other platforms, we try to rely on compile-time behavior for acceleration.

template<mix4 comp0, mix4 comp1, mix4 comp2, mix4 comp3>
vector4f RTM_SIMD_CALL vector_mix(vector4f_arg0 input0, vector4f_arg1 input1) RTM_NO_EXCEPT
{
        constexpr int index0 = (int)comp0;
        constexpr int index1 = (int)comp1;
        constexpr int index2 = (int)comp2;
        constexpr int index3 = (int)comp3;
#if defined(__clang__)
        return __builtin_shufflevector(input0, input1, index0, index1, index2, index3);
#else
        if constexpr (index0 < 4 && index1 < 4 && index2 >= 4 && index3 >= 4) {
                return vector_shuffle(input0, input1, index0, index1, index2 - 4, index3 - 4);
        }
        else if constexpr(index0 < 4 && index1 < 4 && index2 < 4 && index3 < 4) {
                //no input1 use here
                return vector_swizzle(input0, index0, index1, index2, index3);
        }
        else if constexpr(index0 >=4 && index1 >=4 && index2 >=4 && index3 >=4) {
                //no input0 use here
                return vector_swizzle(input1, index0 - 4, index1 - 4, index2 - 4, index3 -4);
        }else {

                float combine_arr[8];
                vector_store(input0, combine_arr);
                vector_store(input1, combine_arr + 4);
                return vector_set(combine_arr[index0], combine_arr[index1], combine_arr[index2], combine_arr[index3]);
        }
#endif
}

Jun 21 '24 09:06 daliziql

All committers have signed the CLA.

Jun 21 '24 09:06 CLAassistant

Hello and thank you for the contribution! I apologize for the late reply, I am just coming back from a trip abroad.

These changes to argument passing are quite subtle and sensitive. I'll have to double check things on my end and compare the generated assembly, etc. As a result, it will take me some time to review things. I anticipate that I'll have time to look into this in early July. I'll get back to you then.

In the meantime, I just wanted to give some general context. Your analysis makes a lot of sense, but there's a few things at play that are worth considering.

Float32 arithmetic on ARM uses NEON SIMD registers. This allows us to pass vector/quat/mask values by value in register and return them by register as well. For aggregate types (e.g. qvv, matrix), things are a bit more complicated. For clang, a few aggregate types (depending on size/internals) can be passed by value in register BUT aggregate values are not returned by register (unlike with __vectorcall with MSVC). When functions inline, this distinction doesn't really matter but when they don't it comes into play as it forces round-trips to stack memory (also called a load-hit-store). Typically, moderns processors handle this case quite well through store-forwarding but a few extra cycles on the load are still incurred. As a result of this, code that uses float32 ends up being quite dense with many instructions dependent on one another which can introduce bubbles in the execution and extra latency from store-forwarding further increases latency.

In contrast, float64 uses scalar math on ARM (for the time being, it is on my roadmap to use SIMD registers for XY and ZW in pairs like we do with SSE). Using scalar math causes the generated assembly to be much larger as many more instructions are required. This has an adverse effect on inlining as large functions don't inline as well. However, despite the large number of instructions, most of them can execute independently as SIMD lanes are often independent. This means that with float64, there are far fewer bubbles in the execution stream and there is far more work to execute. As a result, with modern out-of-order CPUs, they can be kept well fed with few to no stalls in execution. And so, even if each instruction is more expensive, the gap in execution cost between float32 and float64 might not be as large as one might expect in practice. Note that using XY and ZW in pairs will help reduce the assembly size, improving inlining and performance but because both pairs are often independent, the rest of the analysis remains consistent.

In the end, whether a function inlines or not is often the biggest performance impact at play and matrix math often uses many registers and many instructions, hindering inlining. Crucially, whether a function inlines or not is also determined by where it is called and so the measurements depend heavily on the sort of code that you have. Are you at liberty to share what the calling code looks like and which RTM functions are involved in your measurements or did you do broad measurements over a large and complex piece?

Cheers, Nicholas

Jun 23 '24 15:06 nfrechette

Hi Nicholas, Thank you for your reply. The main content of the test involves matrix composition, transformation, and inversion operations. Below is the general framework of the test code:

//----------------------------------------------------------------------------------------
// matrix compose and transform
//----------------------------------------------------------------------------------------
template<typename FloatType, typename CalcPolicy>
static void DoMatrixComposeImpl(benchmark::State& state) {
    using Vector4Array = std::vector<TSimdVector<FloatType>>;
    using QuaternionArray = std::vector<TQuaternion<FloatType>>;
    
    Vector4Array   translationArray;
    Vector4Array   scaleArray;
    QuaternionArray    quatArray;
    Vector4Array   orignalArray;
    Vector4Array    resultArray;

    ...

    for (int i = 0; i < kMathCalcCount; i++) {
       translationArray[i] = TSimdVector<FloatType>(MathTool::rangeRandom(0.0, 1000.0), MathTool::rangeRandom(0.0, 1000.0), MathTool::rangeRandom(0.0, 1000.0), 1.0f);
       scaleArray[i] = TSimdVector<FloatType>(1.0f, 1.0f, 1.0f, 1.0f);
       quatArray[i] = TQuaternion<FloatType>::fromAxisAngle(TVector3<FloatType>::YAxisVector, ScalarTool::degreesToRadians(MathTool::rangeRandom(0.0, 90.0)));
       orignalArray[i] = TSimdVector<FloatType>(0.0, 0.0, 0.0, 1.0);
    }

    for (auto&& _ : state) {
       for (int i = 0; i < kMathCalcCount; i++) {
           TMatrix4<FloatType> tMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(translationArray[i], TSimdVector<FloatType>(1.0, 1.0, 1.0, 1.0), TQuaternion<FloatType>::Identity);
           TMatrix4<FloatType> sMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(TSimdVector<FloatType>(0, 0, 0, 1), scaleArray[i], TQuaternion<FloatType>::Identity);
           TMatrix4<FloatType> rMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(TSimdVector<FloatType>(0, 0, 0, 1), TSimdVector<FloatType>(1.0, 1.0, 1.0, 1.0), quatArray[i]);

           TMatrix4<FloatType> matrix = tMat.template _simd_multiOther<CalcPolicy>(rMat).template _simd_multiOther<CalcPolicy>(sMat);

           resultArray[i] = matrix.template _simd_transformVector4<CalcPolicy>(orignalArray[i]);
       }
    }
}

In the _simd_xxxx functions within the TMatrix4 class, all matrix operations are implemented internally within our project. Some more specific functions are initially implemented as follows:

RTM_DISABLE_SECURITY_COOKIE_CHECK inline void RTM_SIMD_CALL matrix_mul_fill_mode(
    matrix3x3f_arg0 lhs, matrix3x3f_arg1 rhs) RTM_NO_EXCEPT {
    matrix3x3f_arg0 out_m{};
    
    vector4f tmp = vector_mul(vector_dup_x(lhs.x_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.x_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.x_axis), rhs.z_axis, tmp);
    out_m.x_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.y_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.y_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.y_axis), rhs.z_axis, tmp);
    out_m.y_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.z_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.z_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.z_axis), rhs.z_axis, tmp);
    
    return out_m;
}
auto r = simd::matrix_mul_fill_mode(rhs, lhs);

Here, we encounter the issue of parameter copy passing and the return value problem you mentioned earlier. We have since made changes to such function calls:

RTM_DISABLE_SECURITY_COOKIE_CHECK inline void RTM_SIMD_CALL matrix_mul_fill_mode(
    matrix3x3d_argn lhs, matrix3x3d_argn rhs, matrix3x3d &out_m) RTM_NO_EXCEPT {
    vector4d tmp = vector_mul(vector_dup_x(lhs.x_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.x_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.x_axis), rhs.z_axis, tmp);
    out_m.x_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.y_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.y_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.y_axis), rhs.z_axis, tmp);
    out_m.y_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.z_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.z_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.z_axis), rhs.z_axis, tmp);
    out_m.z_axis = tmp;
}

TMatrix4<T> r{};
simd::matrix_mul_fill_mode(rhs.simdRef(), lhs.simdRef(), r.simdRef());

The main change was to modify the parameter passing of the matrix to be by reference. The performance after these modifications has already shown significant improvement. Additionally, I would like to mention that the performance issues discussed here are occurring on ARM64 Android devices. The performance on Windows and Mac aligns with your expectations.

Jun 24 '24 07:06 daliziql

One more thing is that our project's C++ version is quite high. This PR did not handle compatibility with C++11 well, so I need to make some adjustments.

Jun 24 '24 07:06 daliziql

Thank you for the clarification. I will see if I can add a benchmark test based on your sample and see if I can reproduce locally.

What kind of processors/android device are you seeing this on?

I'll take a look at this in the next 2 weeks.

Jun 30 '24 15:06 nfrechette

The processor is snapdragon-xr2-gen2

Jul 01 '24 02:07 daliziql

I encountered an issue with unit tests. The configurations build pull request / vs2022 (vs2022-clang, release, x64, -simd) and build pull request / vs2022 (vs2022-clang, release, x64, -avx) are indicating that some unit tests are failing. However, when I compile locally with the same CMake options, all tests pass. Do you have any additional information you can provide?

Jul 01 '24 03:07 daliziql

Yes, those failures are probably due to a known compiler/toolchain issue, see this PR for details: https://github.com/nfrechette/rtm/pull/212

I wouldn't worry about it for now. I'm waiting for github to update the image with a newer VS version that has a fixed clang version. Sadly, for reasons unknown, RTM ends up triggering a LOT of compiler bugs in various toolchains. Over the years, I've found dozens of bugs (and reported many) in msvc, gcc, and clang. Thankfully, it has gotten better over the years.

Jul 04 '24 15:07 nfrechette

I added a benchmark to profile argument passing for matrix3x3f here: https://github.com/nfrechette/rtm/pull/219 On my M1 laptop, passing by value is a clear winner and the generated assembly by apple clang makes sense to me. The details are in the cpp files.

The results are as follow for me:

bm_matrix3x3_arg_passing_ref         24.0 ns         24.0 ns     28645439
bm_matrix3x3_arg_passing_value       16.1 ns         16.1 ns     43122832

This is in line with my expectations and confirms why I choose to pass as many aggregates by register as possible:

Passing by value means that functions use fewer instructions: no need to load/store, only maybe mov. This will improve their chances of getting inlined, speeding things up even further.
It leaves the job of loading/storing to the caller which can better schedule the instructions to hide some of their latency (assuming we hit L1 cache).
Intermediary aggregates can sometimes be held entirely in register (e.g. matrix_mul(a, matrix_mul(b, c))
Because we don't know how the input arguments were generated, when we load from memory, we may hit store-forwarding which is reasonably fast but not free. By passing by value, the caller has to pay the price of the load and any potential stalls when profiling

I will see with my Pixel 7 android phone if I can replicate the results when I get the chance this week. I suspect that the results will be consistent.

It may be worthwhile digging further into your benchmark and how you measured the difference. Did you only measure in a micro-benchmark or did you also observe an improvement in a non-synthetic use case (e.g. actual application)? Micro-benchmarks only offer a narrow view and can sometimes not capture the actual cost/benefit of an approach vs another. How did the assembly look before/after the change?

It may also be worthwhile trying to run my benchmark on your device to see if you can reproduce my findings. From there, perhaps you may be able to tweak it to showcase the results you've seen on your end.

Jul 04 '24 20:07 nfrechette

The CI also ran my benchmark on x64 SSE2 with clang 14 and we can see there that the calling convention not returning aggregates by register indeed causes performance issues:

bm_matrix3x3_arg_passing_ref                27.8 ns         27.8 ns     25140949
bm_matrix3x3_arg_passing_value              51.8 ns         51.8 ns     13516762

I'll have to see what the generated assembly looks like there. Later this/next week I'll give that a try on my Zen2 desktop.

Jul 04 '24 20:07 nfrechette

Our test results were also analyzed based on benchmark reports under Android, but the code in the benchmark is slightly more complex. The conclusion of the comparison was initially surprising: SIMD performance for double was actually better than for float, which was counterintuitive. We eventually found that there was a difference in parameter passing between the two, so we made some modifications to the parameter passing. The improved float performance indeed saw a significant boost. Regarding the code in the PR, there are many non-standard parts, and I will modify them one by one. Next week, I will also run your benchmark code on my device to see how it differs from my benchmark.

Jul 06 '24 07:07 daliziql

Here are some more notes profiling argument passing on my Zen2 desktop.

With VS2022 SSE2 and __vectorcall, the results are as follow:

bm_matrix3x3_arg_passing_current       18.8 ns         18.4 ns     37333333
bm_matrix3x3_arg_passing_ref           21.9 ns         21.5 ns     32000000
bm_matrix3x3_arg_passing_value         14.5 ns         14.4 ns     49777778

This is because I originally opted to not pass the second argument by value. This may appear sub-optimal in this synthetic benchmark but in practice, it depends a lot on the function signature. __vectorcall assigns registers in slots where a slot can be used by an int/float/vec. Even though an int might not use an XMM register, MSVC ignores this and that XMM register slot will not be assigned. As such, RTM allows some slack for aggregate types and won't use all registers. I'm not fully sold on this, it needs to be measure in a more complex benchmark that isn't synthetic. Here again, passing by value beats passing by reference.

With VS2022 SSE2 without __vectorcall, the results are as follow:

bm_matrix3x3_arg_passing_current       30.1 ns         28.6 ns     22400000
bm_matrix3x3_arg_passing_ref           21.8 ns         22.2 ns     34461538
bm_matrix3x3_arg_passing_value         33.1 ns         33.0 ns     21333333

Here, surprisingly, we can see that passing by value is slower than by reference. It is slower because with the default calling convention, vectors passed by value are written on the stack and thus passed by reference underneath the hood. Current is also slower. Here, current ends up returning the matrix by value on the stack while arguments are passed by reference, and it must be copied to the actual variable upon return. This is why it is slower than by reference where the return address is provided by an argument. __vectorcall is a clear winner here as it can avoid a lot of extra work.

With VS2022 SSE2 and Clang 17, the results are as follow:

bm_matrix3x3_arg_passing_current       30.0 ns         30.1 ns     24888889
bm_matrix3x3_arg_passing_ref           22.1 ns         21.5 ns     32000000
bm_matrix3x3_arg_passing_value         31.1 ns         31.8 ns     23578947

The numbers here are slightly different but consistent with the SSE2 non-vectorcall ones. The assembly is slightly different but the end result is the same for all 3.

With my Pixel 7, the results are as follow:

bm_matrix3x3_arg_passing_current       18.5 ns         18.5 ns     37515808
bm_matrix3x3_arg_passing_ref           25.9 ns         25.8 ns     27024174
bm_matrix3x3_arg_passing_value         18.5 ns         18.4 ns     37885260

Here as well, the numbers are consistent with my M1 laptop: passing and returning by value is faster than by reference.

Overall, it's tricky. What is optimal for NEON and vectorcall isn't optimal elsewhere.

Jul 07 '24 03:07 nfrechette

Thank you very much for sharing the data. It seems that maintaining the original method of passing parameters by value would better meet the requirements of the rtm library. I have also extracted the business-related content from my local project and conducted benchmark tests specifically on passing parameters by value versus passing parameters by reference . The results show that the performance of both methods is almost identical, and there is no significant advantage of passing parameters by reference over passing parameters by value. I apologize for the premature and incorrect conclusion I made earlier. Once again, thank you for your professional response, which has been very beneficial to me. I will spend some more time analyzing the actual cause of the issue in my project.

Jul 08 '24 09:07 daliziql

Thank you for taking the time to dig deeper :)

Writing synthetics benchmarks is as much art as it is science. It is not trivial, especially for simple low level functions with few instructions like this. It is very easy to end up measuring side effects that you did not intend to measure or accounted for. I've made many mistakes in the past when writing them and in the end, sometimes, it isn't possible to capture the true impact that would be seen in real world usage. I've seen many cases where synthetic benchmarks show a win for something vs another which turns out to be the opposite in real code due to inlining and scheduling (of such small low level things). As an example, I spent at least 3-6 months figuring out how to properly benchmark animation decompression: https://github.com/nfrechette/acl/blob/ac1ea98938eef4d4bd4c9742a059cb886cad19d5/tools/acl_decompressor/sources/benchmark.cpp#L50

In the end, sometimes it isn't possible to write a function that is optimal on every architecture or every usage scenarios. RTM aims to provide sane defaults where possible, but it is expected that if you need specialized versions for your needs (due to the unique circumstances of your code) you'll write them outside RTM. For example, sometimes you need a matrix function inlined in a tight very hot loop even though in general you might not want to always inline it due to code size bloat. Another is with my animation compression library where I need stable versions of quaternion functions that won't change as I update RTM to ensure determinism over time. That being said, if you think something is more widely useful and should belong within RTM, feel free to submit a PR as you did and we can discuss and consider it :)

Jul 09 '24 01:07 nfrechette

Out of curiosity, I also added the same benchmark for matrix3x3d to see.

bm_matrix3x3d_arg_passing_current             34.6 ns         34.6 ns     20240340
bm_matrix3x3d_arg_passing_ref                 25.6 ns         25.5 ns     27587077
bm_matrix3x3d_arg_passing_value               34.3 ns         34.3 ns     20299917
bm_matrix3x3f_arg_passing_current             16.2 ns         16.2 ns     43312275
bm_matrix3x3f_arg_passing_ref                 24.0 ns         24.0 ns     29134134
bm_matrix3x3f_arg_passing_value               16.2 ns         16.2 ns     43387982

Doubles are slower but as you've found, when passing by reference, they are almost as fast even though they use scalar arithmetic instead of using SIMD pairs. With SIMD pairs, perhaps double could get faster under this synthetic benchmark. However, passing by value (which is currently the default for the first matrix3x3d argument) is quite a bit slower. It appears that with doubles, the aggregate structures are not passed by register as arguments nor are they returned by register as a return value. This means that we have to roundtrip to the stack every time :( I'll add a note to double check this with the NEON documentation as it appears that this could be improved. That might also change down the road once I optimize doubles to use SIMD pairs. Thanks to your input, we now have benchmarks to track this :)

Jul 09 '24 01:07 nfrechette

Hello, This PR has been open for a while. The last time around, I mentioned a number of changes required for me to merge this into the main branch. If you don't get the chance to do so, I'll take some inspiration from your suggestions and go ahead to make some of these myself in a separate PR. I'll then subsequently close this one.

Let me know if you'd like to pick up where you left off or if I can go ahead with my own changes.

Cheers, Nicholas

Feb 13 '25 02:02 nfrechette

Hello, This PR has been open for a while. The last time around, I mentioned a number of changes required for me to merge this into the main branch. If you don't get the chance to do so, I'll take some inspiration from your suggestions and go ahead to make some of these myself in a separate PR. I'll then subsequently close this one.

Let me know if you'd like to pick up where you left off or if I can go ahead with my own changes.

Cheers, Nicholas

Hi Nicholas, Thank you for reaching out. At the moment, I am unable to make the necessary changes. Please feel free to proceed with your own modifications in a separate PR, as you suggested. I appreciate your understanding and support.

Feb 13 '25 04:02 daliziql

I've incorporated various vector_mix optimizations in #232 including the usage of the builtin vector shuffle for GCC and Clang.

I played with the idea of using template specialization and ended up electing not to use it for two reasons:

I wanted to avoid introducing a nested function call, it stresses inlining and it is something I generally try to avoid when I can for such simple functions that ideally should always inline. Nested inlining doesn't play well across all toolchains.
Template specialization was also significantly slower to compile in my local tests. It seems that adding new function definitions is much slower than the constexpr evaluations. Because this is a core header that is likely included in TONS of cpp files, a compile time regression of that magnitude was not desirable even if the code was cleaner.

I also fully implemented all permutations for vector_mix for SSE2/SSE4 both for vector4f and vector4d. This should yield optimal codegen. AVX was not used as the perm instruction is still significantly slower on some mainstream AMD chips which makes its use precarious in the general case. I also improved the permutations for NEON/NEON64 but there remains some room for improvement there.

Cheers, Nicholas

Mar 09 '25 15:03 nfrechette