CMSIS-DSP icon indicating copy to clipboard operation
CMSIS-DSP copied to clipboard

arm_float_to_q31: why not to use VCVT.F32.S32 on Cortex-M4F

Open AlanCui4080 opened this issue 1 year ago • 4 comments

Hi,

As what we know, Cortex-M4F implemented a little set of FPU instructions including VCVT.F32.S32, and GCC did have a builtin intrinsic for it. But the question is why the intrinsic is only enabled when NEON available in GCC, and also, why not to use it in arm_float_to_q31.

Alan.

AlanCui4080 avatar Oct 21 '24 06:10 AlanCui4080

@AlanCui4080 I don't see any reason. Perhaps the function was first developped for M0 and was not upgraded to support all other architectures.

christophe0606 avatar Oct 21 '24 08:10 christophe0606

@christophe0606

Sorry for my mistake, it's ok to include arm_neon.h on M4F, even there is only a subset of NEON implemented on M4F, but the vcvt_s32_f32 will call a SIMD instruction "FCVTZS Vd.2S,Vn.2S" which is invalid on M4F. The only vaild one is "VCVT.F32.S32 Sd,Sm #fbits" included in FPv4-SP. I'm testing it on my STM32G4, if that be ok, i will put a pull request.

AlanCui4080 avatar Oct 21 '24 08:10 AlanCui4080

@AlanCui4080 I won't include arm_neon.h for M4F.

There are thus two possibilities : this intrinsics is supported by the Arm C language extensions (ACLE). Unfortunately, there are still too many compilers that are not fully implementing all of the ACLE.

That's why most of the intrinsics used by CMSIS-DSP are coming from CMSIS-Core (part of CMSIS-6). And probably, you'll need to open an issue on CMSIS-Core if you want this new intrinsic to be supported

christophe0606 avatar Oct 21 '24 09:10 christophe0606

@christophe0606 I figure out it, ACLE have no single precision version (FPv4-SP) for vcvt_s32_f32, it's either d-register or q-register in ACLE. So I do believe someone forget to add it into ACLE, because this instruction is unpopular and not wide used.

And the CMSIS-Core seems to be not include any part of FPU instructions, where should i put it in.

Note: following inline asm is proved usable as a replacement of arm_float_to_q31.

asm("vcvt.s32.f32 %0, %0, #31": "+t"(thetaf_divpi.f)::);

AlanCui4080 avatar Oct 21 '24 09:10 AlanCui4080

@christophe0606 Hi, Is that ok to add a inline assembly specilized for M4? If so, i will open a pull request.

AlanCui4080 avatar Oct 31 '24 09:10 AlanCui4080

@AlanCui4080 asm definitions are provided through CMSIS-Core by defining new intrinsics macros in the compiler headers

So you can try to open a github issue on the CMSIS repository.

christophe0606 avatar Oct 31 '24 09:10 christophe0606

@christophe0606, @AlanCui4080, where does this new instrinsic belong to? Is this one the correct location? https://github.com/ARM-software/CMSIS_6/blob/8c4dc58928b3347f6aa98b6fb2bf6770f32a72b7/CMSIS/Core/Include/cmsis_gcc.h#L877-L992

Please feel free to raise a PR adding the required define. Please bare in mind that this is location not specific to Cortex-M4F. Nor is is specialized for Arm-v7EM but is pulled for A- and R-class as well. If this is strictly for Arm-v7EM, it would need to go into https://github.com/ARM-software/CMSIS_6/blob/main/CMSIS/Core/Include/m-profile/cmsis_gcc_m.h et al.

JonatanAntoni avatar Nov 04 '24 11:11 JonatanAntoni

@JonatanAntoni @AlanCui4080 It must be added to each compiler and would be for Cortex-M only. (Since the intrinsics is defined with Neon header).

I don't have compiler specific implementation (except in case of major compiler bug) in CMSIS-DSP so this new intrinsics should be made available for all compiler so that CMSIS-DSP can still be built with all of them.

Otherwise, it is now possible to define functions as WEAK for linker in CMSIS-DSP. So, another possibility is for a user to replace a specific function with a different implementation if for some reason we can't do it on our side.

christophe0606 avatar Nov 04 '24 11:11 christophe0606

@christophe0606, @AlanCui4080, where does this new instrinsic belong to? Is this one the correct location? https://github.com/ARM-software/CMSIS_6/blob/8c4dc58928b3347f6aa98b6fb2bf6770f32a72b7/CMSIS/Core/Include/cmsis_gcc.h#L877-L992

@JonatanAntoni No, that is not a DSP instruction, it's belong to FPU. I have never seen intrinsics for FPUs in CMSIS-Core. I think that's because CMSIS-DSP is designed for fixed-point at beginning, and FPU instructions are duty of compliers.

Meanwhile, there is no complier include intrinsics for FPU instuctions (But have MVE&NEON) to itself. Well, so i think people all focused on newer architecture such one have MVE. So armv7 old FPU just be forgot like 8087 insturctions.

See https://developer.arm.com/documentation/ddi0439/b/BEHJADED, since there is no absolute value statement, square root statement, or q31 to float statement in C language, compliers will never generate such instructions to accelerate program running. For newlib, they wrote a inline asm for sqrtf() see https://github.com/bminor/newlib/blob/master/newlib/libm/machine/arm/e_sqrt.c, this fact also pointed out that "there is no complier include intrinsics for FPU instuctions". By the way, i didn't see they did the same thing for "FABS.F32" or "FCVT.XX.XX", may because just no one cared.

These FPU instructions are orphans, i have no idea about where to add them.

AlanCui4080 avatar Nov 04 '24 19:11 AlanCui4080

There is no need for an intrinsic for anything that can be expressed in standard C - as long as it's reasonably straightforward to write and for the compiler to generate the corresponding instruction.

A compiler can generate "FABS.F32" from C source using fabs(), and it can generate the fixed-point FCVT by using an expression like (int) (float_val * 0x1p31f). Certainly at -Ofast.

I've checked and GCC will do both of those. Clang fails to generate the FCVT, but it would make more sense to improve its optimisation so it does than to specify and implement an intrinsic.

I'd agree that there are some scalar instructions falling into gaps between the compiler, ACLE and vector headers, but this isn't one of them.

At more conformant settings there might be problems like you have in heavier functions to do with errno handling. (See my https://github.com/ARM-software/CMSIS-DSP/issues/72 ). But again, both compilers can generate FSQRT.F32 with the right settings.

So this assertion is totally untrue:

since there is no absolute value statement, square root statement, or q31 to float statement in C language, compliers will never generate such instructions to accelerate program running.

Compiler explorer demo:

https://godbolt.org/z/WMxoGMMx1

kjbracey avatar Jan 27 '25 10:01 kjbracey

So this assertion is totally untrue:

since there is no absolute value statement, square root statement, or q31 to float statement in C language, compliers will never generate such instructions to accelerate program running.

Compiler explorer demo:

https://godbolt.org/z/WMxoGMMx1

Sorry for my mistake.

But, this question is for arm dsp library, as you can see, this library manually control behavior for every cortex models, and use hardware intrinsic if it exists. So, i think its not good enough to depend a performance critical library on complier auto infering.

I'd agree that there are some scalar instructions falling into gaps between the compiler, ACLE and vector headers, but this isn't one of them.

It does have no intrinsic, meanwhile, absolute value statement having __fabs() (https://developer.arm.com/documentation/dui0375/g/Compiler-specific-Features/--fabs-intrinsic), square root statement having __sqrt() (https://developer.arm.com/documentation/dui0376/c/Compiler-specific-Features/Instruction-intrinsics/--sqrt-intrinsic), so i will keep on my opion, it did lost in gaps.

Could I close this issue by adding exceptions for m4f with inline assembly?

AlanCui4080 avatar Jul 22 '25 05:07 AlanCui4080

@AlanCui4080

So, i think its not good enough to depend a performance critical library on complier auto infering.

CMSIS-DSP is a trade-off between portability and performance. As consequence, and unfortunately, it is dependent on what compilers can do.

It is thus not surprising that you can often get better performance by writing an asm kernel.

If you can't get the new intrinsics added to CMSIS Core then the other solution is to replace the function by a custom one using weak linking.

I won't add any asm kernel or asm intrinsics to CMSIS-DSP directly. CMSIS-DSP is only relying on intrinsics either provided by the compiler or a library like CMSIS Core.

christophe0606 avatar Jul 22 '25 05:07 christophe0606

@christophe0606

As consequence, and unfortunately, it is dependent on what compilers can do. But I see many manually vectorizing in code.

Anyway, as it can be auto optimized by complier , i will close this issue, thank you for your help.

AlanCui4080 avatar Jul 22 '25 08:07 AlanCui4080