libsimdpp icon indicating copy to clipboard operation
libsimdpp copied to clipboard

optimized widening/narrowing operations

Open peabody-korg opened this issue 7 years ago • 3 comments

support for effectively doing float32<2> <-> float64<2>, but actually operating on the 2 lower lanes of a float32<4>. Currently the narrowest conversion operation is float32<4> <-> float64<4>, which results in emitting extra instructions for SSE and NEON if all you want is the 2 lower lanes.

peabody-korg avatar Dec 01 '17 20:12 peabody-korg

Proper solution for this issue would be to implement float32<2> support, which is quite a lot of changes. As a workaround a function that just converts lower lanes of float32<4> to float64<2> could be written, but it would make the public API inconsistent. If included into the library, the function should go into simdpp::unsupported or similar namespace :-)

p12tic avatar Dec 02 '17 12:12 p12tic

Opened issue #93 for small vector support.

p12tic avatar Dec 02 '17 12:12 p12tic

small vector support would be useful for all kinds of things! no rush though. I have a VectorUtil namespace full of my own extensions. Here's what I have for this particular case:

namespace VectorUtil {
	// return { x[0], x[1] }
	inline simdpp::float64<2> HalfVectorToFloat64(simdpp::float32<4> x)
	{
	#if SIMDPP_USE_SSE2		// also catches AVX
		return _mm_cvtps_pd(x.native());
	#elif SIMDPP_USE_NEON64
		return vcvt_f64_f32(vget_low_f32(x.native()));
	#else
		Vector::Float64<4> u = simdpp::to_float64(x);
		Vector::Float64<2> r, dummy;
		simdpp::split(u, r, dummy);
		return r;
	#endif
	}

	// return { x[0], x[1], -, - }
	inline simdpp::float32<4> HalfVectorToFloat32(simdpp::float64<2> x)
	{
	#if SIMDPP_USE_SSE2		// also catches AVX
		return _mm_cvtpd_ps(x.native());
//	#elif SIMDPP_USE_NEON64
		// unable to find an A64 intrinsic version that looks any better than the default
	#else
		return simdpp::to_float32(simdpp::combine(x,x));
	#endif
	}
}

It's sufficient for our project needs, and has been tested on clang (intel) and gcc (intel, a64). So I wouldn't actually need anything added to the library for this for now.

Regarding the NEON64 HalfVectorToFloat32() case, I just gave up after a while. I'm not that skilled at NEON intrinsics yet. Seems like there ought to be something equivalent to SSE2 _mm_cvtpd_ps(), but the conversions between float32x2_t and float32x4_t were thwarting any attempt at optimization. Although I managed to get gcc to emit a single 2-lane narrowing instruction, it was surrounded by what looked like a lot of unnecessary move instructions.

peabody-korg avatar Dec 02 '17 16:12 peabody-korg