Simd Alpha blend small BGRA image onto large YUV420P image

Been struggling with this simple idea all weekend. I want a fast way to overlay a small BGRA image on a large YUV420P image. Finally figured out the best way to do this, apart from having a dedicated function, is to first use SimdBgraToYuva444pV2 to convert the BGRA and then call SimdAlphaBlending 3 times. However, SimdBgraToYuva444pV2 is missing.. A less elegant way would be to use SimdBgraToYuva420p and use two alpha masks, but I am already foreseeing mask errors.. Is it possible to add SimdBgraToYuva444pV2 (or alternatively extend SimdAlphaBlending)?

Jun 05 '22 09:06 mikeversteeg

If BGRA is small then overhead of call SimdBgraToYuv420pV2 and SimdDeinterleave is not too significant, is it?

P.S. I will add SimdBgraToYuva444pV2 to road map.

Jun 06 '22 08:06 ermig1979

Correct me if I'm wrong but I can't seem to use SimdAlphaBlending between to 420 bitmaps because the alpha plane is bigger than the U and V planes, right? As I said, working with two different alpha masks in the same image is bound to introduce artefacts..

Jun 06 '22 08:06 mikeversteeg

Reduced in 2x2 times mask is used for U and V color planes. Does really it get so awful artefacts?

Jun 06 '22 09:06 ermig1979

No, not really in practice fortunately. Alpha blending 2 YUV420 images, which requires 3* SimdAlphaBlending and 1* SimdReduceColor2x2 , is very slow though for high resolution bitmaps. Much slower than any other function I've used so far. And I need to do a lot of it (keying).

Aug 13 '22 12:08 mikeversteeg

I assume you already avoid blending of pixels that have alpha 0 and 255? This significantly speeds up blending of bitmaps that are largely transparent and only have e.g. a small logo in the top corner.

Aug 13 '22 13:08 mikeversteeg

Hi! AlphaBlending is memory bounded operation. So checking of alpha chanel for 0 or 255 alpha is equal to perform full cycle by performance. When you add small logotype or watermark to large image you have to perform alpha blending on subregion (use method View::Region()).

Aug 15 '22 08:08 ermig1979

Understood. Thanks.

Aug 15 '22 09:08 mikeversteeg

How about multi-layered blending, i.e. one pass to blend multiple images, to save on write cycles?

Aug 15 '22 15:08 mikeversteeg

Multi-layered blending will be faster than sequential performance of single blending due to memory throughput saving. But multi-layered blending algorithm is much more difficult that single in common case. How many layers do you want to blend?

Aug 15 '22 15:08 ermig1979

I would be over the moon if you'd give me 2 now.

Aug 15 '22 16:08 mikeversteeg

As one of possible solution you can process frames by parts (blocks of rows). The size of the parts is taked that it can be placed into L1 or L2 cache.

Aug 16 '22 06:08 ermig1979

I tried this for L2 and it does not improve the speed, in fact it makes the blending slower. Possibly because of the overhead and because the alpha falls outside the cache causing a miss? I can see how non-planar bitmaps would be beneficial in this situation.

Aug 16 '22 13:08 mikeversteeg

Did you implement something like this:

for(size_t row = 0; row < dst.height; row += step)
{
    Rect reg(0, row, width, row + step);
    Simd::AlphaBlending(src1.Region(reg), alpha1.Region(reg), dst.Region(reg).Ref());
    Simd::AlphaBlending(src2.Region(reg), alpha2.Region(reg), dst.Region(reg).Ref());
}

?

Aug 16 '22 13:08 ermig1979

Yes. But keep in mind alpha falls outside the cache..

Aug 16 '22 13:08 mikeversteeg

I am going to perform some tests to clear this situation.

Aug 16 '22 14:08 ermig1979

Hi! I'm sorry for answer delay. I was busy for other activities. I thought about this issue (double alpha blending) but I have some doubt about effectivity of this solution. This function (AlphaBlending) has a bottlenek - memory throughput. For BGR memory throughput budget is (2 AlphaBlending: Load - (SRC(3) + ALP(1) + DST(3))*2 = 14, Save -DST(3)*2 = 6), AlphaBlending2x: Load - (SRC(3) + ALP(1))*2 + DST(3) = 11, Save - DST(3) = 3). So performance gain is restricted from (14 - 11)/11 = 27% to (6 - 3) / 3 = 100%. Also the negative test result above is disturbing me.

Aug 24 '22 09:08 ermig1979

Memory throughput is definitely my biggest challenge at this point. You can get as many cores (and caches) as you like on e.g. AWS EC2, but memory speed is unfortunately hard limited. I managed to squeeze a bit more performance out of e.g. scaling and alpha blending by using separate threads for Y and UV, but when serialising operations the overall latency reaches the frame time (20 ms) making this impossible. I'm currently using UHD 420p btw.

Aug 24 '22 09:08 mikeversteeg

I added function AlphaBlending2x. (base version, optimizations will be later). Could you check its correctness?

Aug 24 '22 10:08 ermig1979

Will do. Builds fine btw.

Aug 24 '22 12:08 mikeversteeg

Well it is working, but performance is not good. I do not know what you mean with base (AVX?) but as it is now there is no improvement at all.

Aug 26 '22 12:08 mikeversteeg

I added AVX2 optimizations yesterday.

Aug 26 '22 22:08 ermig1979

I guess I tested an old version then.. It is a lot of work to run a test on the production setup, and now even more difficult as productions have started, so it may take a while before I can test again. Meanwhile I did start to wonder if we would achieve similar gain changing this function not to blend two images at once, but blend the Y & UV parts at once. Also two images, but not of the same size.. That would make the use of the function much easier, and would fit nicely with all other functions. SimdAlphaBlendingYuva420p?

Aug 30 '22 09:08 mikeversteeg

I added function SimdAlphaBlendingBgraToYuv420pSimdAlphaBlendingBgraToYuv420p.

Jan 27 '23 09:01 ermig1979

Simd Simd copied to clipboard

Alpha blend small BGRA image onto large YUV420P image

Simd
Simd copied to clipboard