BasicBitmap _mm_maskmoveu_si128 is slow on AMD

maskmovdqu - Reciprocal throughput: K8 - 26 K10 - 24 Bulldozer - 61 Piledriver - 92 Steamroller - 31 Bobcat - 260-2300 ... Dangerous!!! Jaguar - 34 ( latency 43-2210 )

IMO it'd be faster to use some combination of Logic operations for TransparentBlit_x (( p & d ) | ( ~p & s )) // intuitive way (( s | p ) ^ ( ~d & p )) // dst can be ready later (( ~s & ~p) ^ (~p | d )) // extra neg ((( d ^ s ) & p ) ^ s ) // no neg ((( d ^ s ) & ~p ) ^ d ) // etc.

so something like: s = _mm_loadu_si128( &src[x] ); d = _mm_loadu_si128( &dst[x] ); p = _mm_cmpeq_epi8( ck, s ); // generate mask r = _mm_xor_si128( _mm_or_si128( s, m ), _mm_andnot_si128( d, m ) ); // blend _mm_storeu_si128( &dst[x], r );

May 30 '16 22:05 aqrit

Thanks, PCMPEQ with bit operations is a well know way to perform transparent blitting in the early MMX years (since mmx doesn't have a maskmov instruction).

It got a good performance in the old machines, but very bad performance in the modern hardware. In my old desktop it is a litter faster than maskmov when data size of both src and dst bitmap are less than L2 cache size, or both src / dst bitmap data are right in the L2 cache (small size bitmap eg, repeating blitting same bitmap onto another). But there is a distinct performance reduction when L2 cache missing occured, because it requires a loadu_si128 from destination.

In normal usage, there is no "blitting to the same bitmap 1000 times", there are always blitting from a random source to a random destination.

Maybe there is also a memory fetching micro instruction in the hardware implementation of maskmov, but implementing in hardware is better than implementing in software. Since maskmov is a widely used sse2 instruction for over 10+ years and it is much faster than bit operations with pcmpeq in my computer (intel), I made my choice to use maskmov instead of bit operations.

Will you please provide your test code ? I will test it if I get a amd desktop. Maybe it's a good choice to add a hardware white list to choose bit operations for some amd brand cpus.

May 31 '16 05:05 skywind3000

I have not done a benchmark. I have an old K8 [Athlon 64 X2 (Windsor)], if you'd like me to time some code.

I saw maskmovdqu had a performace issue here: https://www.cendio.com/bugzilla/show_bug.cgi?id=4328

I started searching to see how others implemented TransparentBlt… found your project and wished to hear your thoughts.

Thank you for your response :-)

edit: according to IACA the block thru-put for 64 bytes using aligned + logic instruction on Nehalem is 10.00 cycles, but as you said cache misses might dominate.

May 31 '16 16:05 aqrit