stdarch icon indicating copy to clipboard operation
stdarch copied to clipboard

machine code for aarch64 `vcombine_` intrinsics maybe suboptimal

Open gnzlbg opened this issue 6 years ago • 5 comments

Clang implements the vcombine_ intrinsics using shufflevector (https://github.com/llvm-mirror/clang/blob/master/test/CodeGen/aarch64-neon-vcombine.c). I've done the same (https://github.com/gnzlbg/stdsimd/commit/b2fdeda18b1fb4c8b7c8706f48e0d2637dc4966b#diff-2e4ef22de80cb67140d6b5ea99acea70R627) but instead of getting this (https://godbolt.org/g/TVw4Mq) or this (https://godbolt.org/g/xJTMHe):

vcombine_f64(__Float64x1_t, __Float64x1_t): // @test_vcombine_f64(__Float64x1_t, __Float64x1_t)
  mov v0.d[1], v1.d[0]
  ret

I'm getting something like this:

disassembly for coresimd::coresimd::aarch64::neon::assert_vcombine_f32_dup::vcombine_f32_shim: 
	 0: adrp x8, e4000 <byte_str.j.llvm.1524587332910266792+0x2d0> 
	 1: ldr x8, [x8, #4008] 
	 2: adrp x9, 99000 <byte_str.L.llvm.6299433742659787578+0x23> 
	 3: add x9, x9, #0x84b 
	 4: mov w10, #0x28 // #40 
	 5: mov v0.d[1], v1.d[0] 
	 6: stp x9, x10, [x8] 
	 7: ret 

gnzlbg avatar Jul 31 '18 16:07 gnzlbg

Is it inlined into vcombine_f32_shim right? So isn't the rest of the code from that? What does the source code for vcombine_f32_shim look like?

parched avatar Jul 31 '18 21:07 parched

@parched the source is here: https://github.com/gnzlbg/stdsimd/blob/table_lookup/coresimd/aarch64/neon.rs#L626

Is it inlined into vcombine_f32_shim right? So isn't the rest of the code from that?

Might be. If they are inlined into the shim, the code should be that of the shim.

gnzlbg avatar Jul 31 '18 21:07 gnzlbg

@gnzlbg I meant the actual shim function which I see is generated here. What does that expand to? Or, alternatively, what does the IR look like?

parched avatar Jul 31 '18 22:07 parched

That would be this here: https://github.com/gnzlbg/stdsimd/blob/b699bef2cb285089f5f1f2c9f3305f6caab0833e/crates/assert-instr-macro/src/lib.rs#L111

It is basically a non-#[inline] function that just hast the same arguments as the intrinsics, and calls it with them, and then returns the return type of the intrinsic.

gnzlbg avatar Jul 31 '18 22:07 gnzlbg

Ah ok I see thanks, the other instructions look to just be the setting of _DONT_DEDUP, ~although I'm not sure why it storing x10 too.~ oh that's just the length

parched avatar Jul 31 '18 22:07 parched

rustc generates the same instructions with clang now. https://rust.godbolt.org/z/xYKd97cW6

Nugine avatar Nov 21 '22 17:11 Nugine

Great!

Amanieu avatar Nov 21 '22 22:11 Amanieu