Simd
Simd copied to clipboard
Alpha blend small BGRA image onto large YUV420P image
Been struggling with this simple idea all weekend. I want a fast way to overlay a small BGRA image on a large YUV420P image. Finally figured out the best way to do this, apart from having a dedicated function, is to first use SimdBgraToYuva444pV2
to convert the BGRA and then call SimdAlphaBlending
3 times. However, SimdBgraToYuva444pV2
is missing.. A less elegant way would be to use SimdBgraToYuva420p
and use two alpha masks, but I am already foreseeing mask errors..
Is it possible to add SimdBgraToYuva444pV2
(or alternatively extend SimdAlphaBlending
)?
If BGRA is small then overhead of call SimdBgraToYuv420pV2 and SimdDeinterleave is not too significant, is it?
P.S. I will add SimdBgraToYuva444pV2 to road map.
Correct me if I'm wrong but I can't seem to use SimdAlphaBlending
between to 420 bitmaps because the alpha plane is bigger than the U and V planes, right? As I said, working with two different alpha masks in the same image is bound to introduce artefacts..
Reduced in 2x2 times mask is used for U and V color planes. Does really it get so awful artefacts?
No, not really in practice fortunately. Alpha blending 2 YUV420 images, which requires 3* SimdAlphaBlending
and 1* SimdReduceColor2x2
, is very slow though for high resolution bitmaps. Much slower than any other function I've used so far. And I need to do a lot of it (keying).
I assume you already avoid blending of pixels that have alpha 0 and 255? This significantly speeds up blending of bitmaps that are largely transparent and only have e.g. a small logo in the top corner.
Hi! AlphaBlending is memory bounded operation. So checking of alpha chanel for 0 or 255 alpha is equal to perform full cycle by performance. When you add small logotype or watermark to large image you have to perform alpha blending on subregion (use method View::Region()).
Understood. Thanks.
How about multi-layered blending, i.e. one pass to blend multiple images, to save on write cycles?
Multi-layered blending will be faster than sequential performance of single blending due to memory throughput saving. But multi-layered blending algorithm is much more difficult that single in common case. How many layers do you want to blend?
I would be over the moon if you'd give me 2 now.
As one of possible solution you can process frames by parts (blocks of rows). The size of the parts is taked that it can be placed into L1 or L2 cache.
I tried this for L2 and it does not improve the speed, in fact it makes the blending slower. Possibly because of the overhead and because the alpha falls outside the cache causing a miss? I can see how non-planar bitmaps would be beneficial in this situation.
Did you implement something like this:
for(size_t row = 0; row < dst.height; row += step)
{
Rect reg(0, row, width, row + step);
Simd::AlphaBlending(src1.Region(reg), alpha1.Region(reg), dst.Region(reg).Ref());
Simd::AlphaBlending(src2.Region(reg), alpha2.Region(reg), dst.Region(reg).Ref());
}
?
Yes. But keep in mind alpha falls outside the cache..
I am going to perform some tests to clear this situation.
Hi! I'm sorry for answer delay. I was busy for other activities. I thought about this issue (double alpha blending) but I have some doubt about effectivity of this solution. This function (AlphaBlending) has a bottlenek - memory throughput. For BGR memory throughput budget is (2 AlphaBlending: Load - (SRC(3) + ALP(1) + DST(3))*2 = 14, Save -DST(3)*2 = 6), AlphaBlending2x: Load - (SRC(3) + ALP(1))*2 + DST(3) = 11, Save - DST(3) = 3). So performance gain is restricted from (14 - 11)/11 = 27% to (6 - 3) / 3 = 100%. Also the negative test result above is disturbing me.
Memory throughput is definitely my biggest challenge at this point. You can get as many cores (and caches) as you like on e.g. AWS EC2, but memory speed is unfortunately hard limited. I managed to squeeze a bit more performance out of e.g. scaling and alpha blending by using separate threads for Y and UV, but when serialising operations the overall latency reaches the frame time (20 ms) making this impossible. I'm currently using UHD 420p btw.
I added function AlphaBlending2x. (base version, optimizations will be later). Could you check its correctness?
Will do. Builds fine btw.
Well it is working, but performance is not good. I do not know what you mean with base (AVX?) but as it is now there is no improvement at all.
I added AVX2 optimizations yesterday.
I guess I tested an old version then.. It is a lot of work to run a test on the production setup, and now even more difficult as productions have started, so it may take a while before I can test again. Meanwhile I did start to wonder if we would achieve similar gain changing this function not to blend two images at once, but blend the Y & UV parts at once. Also two images, but not of the same size.. That would make the use of the function much easier, and would fit nicely with all other functions. SimdAlphaBlendingYuva420p
?
I added function SimdAlphaBlendingBgraToYuv420pSimdAlphaBlendingBgraToYuv420p.