llvm-project RISCV `-mno-strict-align` generates problem instructions for some targets

When passed --target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align, clang-18 and up generate vle64 and vse64 instructions for the following code despite receiving byte pointers:

__attribute__((noinline))
void vector_memcpy(uint8_t* d, uint8_t const* p) {
  __builtin_memcpy(d, p, 16);
}

https://godbolt.org/z/eb4oqM3Ts

If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this.

However, -mno-strict-align offers big performance gains for some scalar code, where unaligned access is supported just fine. There seems to be no way to specify that vector code should respect alignments while scalar code can do unaligned access.

Scalar code "seems to work" when you cast a byte pointer to a wider type when it's unaligned, but there are other issues with that and most portable code prefers memcpy() with the expectation that the compiler will optimise it correctly.

Apr 08 '24 18:04 sh1boot

@llvm/issue-subscribers-backend-risc-v

Author: S H (sh1boot)

When passed ` --target=riscv64-linux-gnu -march=rv64gcv -mno-strict-align`, clang-18 and up generate `vle64` and `vse64` instructions for the following code despite receiving byte pointers:

__attribute__((noinline))
void vector_memcpy(uint8_t* d, uint8_t const* p) {
  __builtin_memcpy(d, p, 16);
}

https://godbolt.org/z/eb4oqM3Ts

If the pointer is unaligned then my Kendryte K230 board (C908) will raise a bus error trying to execute this.

However, -mno-strict-align offers big performance gains for some scalar code, where unaligned access is supported just fine. There seems to be no way to specify that vector code should respect alignments while scalar code can do unaligned access.

Scalar code "seems to work" when you cast a byte pointer to a wider type when it's unaligned, but there are other issues with that and most portable code prefers memcpy() with the expectation that the compiler will optimise it correctly.

Apr 08 '24 19:04 llvmbot

Is unaligned scalar really supported in hardware on C908, or is it emulated via a trap and the trap handler doesn't know how to emulate vector?

Apr 08 '24 19:04 topperc

Unaligned scalar loads definitely make my test code run much faster on C908. If it is emulated then it's emulated so efficiently that it's still better than using extra application code to avoid un-aligned loads.

Apr 09 '24 00:04 sh1boot

Is it the case that unaligned scalar is supported while unaligned vector is not supported? https://github.com/llvm/llvm-project/pull/73971 cc @preames I don't remember the details, maybe t-head guys can help to answer this. @zixuan-wu @rjiejie

Apr 09 '24 04:04 wangpc-pp

Well on the k230 SOC, unaligned scalar memory access does work and does not have a significant performance impact. It is much faster than loading bytes and assembling them into a word in application code. Enabling this optimisation has a big impact on performance.

Meanwhile, vle8 works at arbitrary memory offsets, and vle64 raises a bus error if given an odd memory address.

Is it supposed to be configurable in the kernel? I don't know. Somebody is shipping kernels configured this way anyway.

Clang-17 currently generates much faster code than clang-18 and clang-19 because the latter cannot emit unaligned scalar access safely.

Apr 10 '24 15:04 sh1boot

For llvm, we should have a way to differentiate fast unaligned access support for both scalars and vectors. The current -mno-strict-align is confusing and resulting in bug like this. See google/android-riscv64 for the context.

Apr 10 '24 20:04 appujee

Just summarising my understanding of things so people can point out if I have it wrong or disagree:

Zicclsm was defined is a mandatory extension in the RVA20 profile and is explicitly specified as indicating support for misaligned load/store for both scalar and vector memory operations. They might execute slowly of course.
- While the success of RISC-V profiles remains to be seen, I think it's fair to consider that an application that subsets RVA20 is going to be fairly niche going forwards. We should moderate our effort in terms of defining compiler flags etc based on this.
Separately there's the issue of whether those loads/stores are fast or not. There was quite a lot of back and forth on this here https://github.com/riscv-non-isa/riscv-c-api-doc/issues/32 (which also documented the recommendation that the compiler maps -mno-strict-align as meaning unaligned access is fast) that but it looks like we never really considered the answer being different for vector vs scalar.

It would be useful to know if there is other hardware where scalar misaligned access are fast, but vector slow. If it's basically unheard of, that might indicate different solutions in terms of command-line interface vs if it's likely to happen quite a lot. In the latter case, perhaps we got the interface for https://github.com/riscv-non-isa/riscv-c-api-doc/issues/32 wrong.

Apr 11 '24 08:04 asb

Zicclsm was defined is a mandatory extension in the RVA20 profile and is explicitly specified as indicating support for misaligned load/store for both scalar and vector memory operations.

Makes sense to not have compiler flags as the misaligned access support for both scalar and vectors is clearly required for the RVA20 profile. Thanks for pointing to the reference.

Apr 11 '24 17:04 appujee

I believe I tried Zicclsm as part of -march when I was trying to figure out how to enable unaligned codegen (noting problems with uneven handling of -mno-strict-align), and it didn't help; but since I want sure I understood the description I didn't raise a bug.

If that is the same thing then I guess C908 might be considered an RVA20 compliant device if something emulated (or enabled) unaligned vector memory ops by default?

But I think RVA20 isn't telling us if the operation is fast enough to use, right? Just that it shouldn't give a bus error. That's what the command line argument would be for iff C908 became a device that met the required spec or if there were others like it?

Apr 12 '24 03:04 sh1boot

I believe I tried Zicclsm as part of -march when I was trying to figure out how to enable unaligned codegen (noting problems with uneven handling of -mno-strict-align), and it didn't help; but since I want sure I understood the description I didn't raise a bug.

If that is the same thing then I guess C908 might be considered an RVA20 compliant device if something emulated (or enabled) unaligned vector memory ops by default?

But I think RVA20 isn't telling us if the operation is fast enough to use, right? Just that it shouldn't give a bus error. That's what the command line argument would be for iff C908 became a device that met the required spec or if there were others like it?

C908 supports general unaligned access for vectors, but not element unaligned access in vectors, which should comply with the relevant spec requirements: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#8-vector-memory-alignment-constraints Therefore, in our optimizations for memory load/store, we generally handle unaligned logic using elements of length 8, and larger elements are employed when aligned. Referring to arch/riscv/kernel/traps_misaligned.c, the kernel we are currently using does not seem to handle unaligned vectors. https://gitee.com/thead-yocto/kernel/blob/master/arch/riscv/kernel/traps_misaligned.c#L240 Based on my impression that vector misalignment is purely handled by hardware, and its performance should be faster compared to regular instructions.

Apr 15 '24 02:04 MaoHan001

Is it the case that unaligned scalar is supported while unaligned vector is not supported? #73971 cc @preames I don't remember the details, maybe t-head guys can help to answer this. @zixuan-wu @rjiejie

Before #73971 -mno-unaligned-access enabled both unaligned scalar and vector accesses anyway, so that was just an internal-facing change

Apr 15 '24 09:04 lukel97

https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858 Based on the above thread, the description of "misalign" for vector in Zicclsm should be consistent with "8. Vector Memory Alignment Constraints" without making additional requirements for element misalignment.

Apr 16 '24 02:04 MaoHan001

https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858 Based on the above thread, the description of "misalign" for vector in Zicclsm should be consistent with "8. Vector Memory Alignment Constraints" without making additional requirements for element misalignment.

This post says the opposite https://lists.riscv.org/g/tech-unprivileged/topic/ar_minutes_2024_1_9/103776876?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,103776876,previd%3D1709213912495550035,nextid%3D1705361818666677767&previd=1709213912495550035&nextid=1705361818666677767

There was a discussion on whether the current profile text actually currently mandates vector misaligned support, and whether it should be mandated. The conclusion was that the current Zicclsm spec does mandate vector misaligned support, and that this is the correct thing to do. Note to be added to clarify.

Apr 16 '24 03:04 topperc

https://lists.riscv.org/g/sig-vector/topic/questions_about_zicclsm/105531858?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,105531858 Based on the above thread, the description of "misalign" for vector in Zicclsm should be consistent with "8. Vector Memory Alignment Constraints" without making additional requirements for element misalignment.

This post says the opposite https://lists.riscv.org/g/tech-unprivileged/topic/ar_minutes_2024_1_9/103776876?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,103776876,previd%3D1709213912495550035,nextid%3D1705361818666677767&previd=1709213912495550035&nextid=1705361818666677767

There was a discussion on whether the current profile text actually currently mandates vector misaligned support, and whether it should be mandated. The conclusion was that the current Zicclsm spec does mandate vector misaligned support, and that this is the correct thing to do. Note to be added to clarify.

This post does not seem to say an opposite argument; it only mentions the hope to clarify the definition of Zicclsm within the RVA23, as the current description does not appear very accurate. In the RVA23, the vector will be mandated, so it is expected by default to support misalignment for vectors as well. Beside, the profile does not seem to involve whether the misalignment for vectors is regarding the entire vector register or the elements (which logically is the content of the vector spec). Referring to the profile state table at https://docs.google.com/spreadsheets/d/1A40dfm0nnn2-tgKIhdi3UYQ1GBr8iRiV2edFowvgp7E/edit#gid=1157775000, both RVA22 and RVA20 are currently ratified, where the requirement for vectors is optional. Meanwhile, Ziccif is still marked as not ratified. It seems strange to proceed with an implementation based on a description that is not yet clear, rather than ratified specifications.

Apr 16 '24 05:04 MaoHan001

Beside, the profile does not seem to involve whether the misalignment for vectors is regarding the entire vector register or the elements (which logically is the content of the vector spec).

A vector that is element aligned is not considered "misaligned" according to the vector spec. There wouldn't be any need to mention vector in the description of Zicclsm in the profile spec if it wasn't meant to refer to the element being misaligned.

Apr 16 '24 05:04 topperc

Zicclsm is not an arch extension in the full sense of the word. It is still called an extension name, but it really is just putting a name to a requirement that mandates what otherwise is optional behavior in an extension (the base ISA extensions in this case). Extension names like this get ratified in conjunction with ratification of the ISA profile that introduces/defines them. Therefore, as long as the ' Vector Memory Alignment Constraints' definition in vector spec is followed, it is considered to have already followed the Zicclsm extension. That is to say 'If an element accessed by a vector memory instruction is not naturally aligned to the size of the element, either the element is transferred successfully or an address misaligned exception is raised on that element'.

Apr 16 '24 06:04 QJtaibai

Zicclsm is not an arch extension in the full sense of the word. It is still called an extension name, but it really is just putting a name to a requirement that mandates what otherwise is optional behavior in an extension (the base ISA extensions in this case). Extension names like this get ratified in conjunction with ratification of the ISA profile that introduces/defines them. Therefore, as long as the ' Vector Memory Alignment Constraints' definition in vector spec is followed, it is considered to have already followed the Zicclsm extension. That is to say 'If an element accessed by a vector memory instruction is not naturally aligned to the size of the element, either the element is transferred successfully or an address misaligned exception is raised on that element'.

My understanding has been that Zicclsm mandates that the element is transferred successfully in all cases and there is never a misaligned exception.

Apr 16 '24 07:04 topperc

I just wanted to report back from the discussion at the end of last week in the RISC-V LLVM sync-up call.

As has also been highlighted here, we previously did have separate features for misaligned scalar and vector alignment, but merged them to match the frontend option as there weren't any users at the time requiring different behaviour.
There's no objection to reintroducing the split between scalar and vector misaligned access support, though to be usable this would likely require work to define what the frontend command-line options should be and also what changes are needed to the defines in riscv-c-api-doc.

This all seems very doable, but would need someone supporting one of these platforms to push it forwards - unfortunately no-one on the call was supporting such a platform, and while I could claim it's on someone's todo list, realistically there's enough other things on everyone's list it's not likely to happen until someone else comes in with patches. It would also be good to better characterise exactly which target devices have fast scalar but slow vector unaligned accesses.

Apr 16 '24 14:04 asb

Does hwprobe report that unaligned accesses are fast on this CPU?

Apr 18 '24 23:04 topperc

Does hwprobe report that unaligned accesses are fast on this CPU?

@sh1boot can you provide the info?

May 14 '24 15:05 appujee

It says Function not implemented. I'll need to go digging around to see what my options are for a viable repo with a more modern kernel.

May 15 '24 04:05 sh1boot

Does hwprobe report that unaligned accesses are fast on this CPU?

@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns?

May 17 '24 22:05 sh1boot

Does hwprobe report that unaligned accesses are fast on this CPU?

@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns?

I write https://github.com/cyyself/hwprobe to have a hwprobe example. The output on k230 is:

[cyy@archlinux hwprobe]$ ./hwprobe
MVENDORID: 5b7
MARCHID: 8000000009140d00
MIMPID: 50000
BASE_BEHAVIOR: IMA_
IMA_EXT_0: FD_C_V_Zba_Zbb_Zbs_Zbc_
CPUPERF_0: MISALIGNED-FAST_
ZICBOZ_BLOCK_SIZE: 64 Bytes
[cyy@archlinux hwprobe]$ dmesg | grep aligned
[    0.163736] cpu0: Ratio of byte access time to unaligned word access is 6.32, unaligned accesses are fast

I think the unaligned access probe through the hwprobe syscall on the newer kernel is OK.

May 18 '24 08:05 cyyself

Does hwprobe report that unaligned accesses are fast on this CPU?

@cyyself, you seem to be actively maintaining kernel patches for k230 on a modern kernel. Can you see what this query returns?

I write https://github.com/cyyself/hwprobe to have a hwprobe example. The output on k230 is:
[cyy@archlinux hwprobe]$ ./hwprobe
MVENDORID: 5b7
MARCHID: 8000000009140d00
MIMPID: 50000
BASE_BEHAVIOR: IMA_
IMA_EXT_0: FD_C_V_Zba_Zbb_Zbs_Zbc_
CPUPERF_0: MISALIGNED-FAST_
ZICBOZ_BLOCK_SIZE: 64 Bytes
[cyy@archlinux hwprobe]$ dmesg | grep aligned
[    0.163736] cpu0: Ratio of byte access time to unaligned word access is 6.32, unaligned accesses are fast
I think the unaligned access probe through the hwprobe syscall on the newer kernel is OK.

hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check.

May 18 '24 17:05 topperc

hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check.

Someone did this: https://lore.kernel.org/linux-riscv/CAJgzZorn5anPH8dVPqvjVWmLKqTi5bkLDR=FH-ZAcdXFnNe8Eg@mail.gmail.com/

May 19 '24 02:05 cyyself

hwprobe documentation does not say whether the unaligned access check is for scalar or scalar+vector. Based on this result, the implementation is only checking scalar. So I think the hwprobe documentation should be updated to say that explicitly if that what it is going to check.

Someone did this: https://lore.kernel.org/linux-riscv/CAJgzZorn5anPH8dVPqvjVWmLKqTi5bkLDR=FH-ZAcdXFnNe8Eg@mail.gmail.com/

Thanks I guess whatever I found while searching was old.

May 19 '24 02:05 topperc

How does vectorizer deal with this issue? cc: @alexey-bataev @nikolaypanchenko

May 22 '24 16:05 appujee

How does vectorizer deal with this issue? cc: @alexey-bataev @nikolaypanchenko

It relies on TTI to tell if misaligned vector load/store is legal or not. If misaligned access can be done via set of instructions, it's expected TTI will return cost of this sequence. For instance, for unmasked misaligned vle64 can be emulated via vle8 with 8*VL.

However, I don't think we do have anything to notify compiler that hw supports misaligned access.

May 22 '24 16:05 nikolaypanchenko

However, I don't think we do have anything to notify compiler that hw supports misaligned access.

-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check.

May 22 '24 16:05 topperc

-mno-strict-align should tell TTI that misaligned accesses are fine. I'll double check.

Shouldn't it be driven by extension ? Otherwise compiled binary won't be portable

May 22 '24 17:05 nikolaypanchenko

llvm-project llvm-project copied to clipboard

RISCV `-mno-strict-align` generates problem instructions for some targets

llvm-project
llvm-project copied to clipboard