blend2d [Portable Pipeline] Composition operators not working (only SrcCopy and SrcOver)

`BLImage img(480, 480, BL_FORMAT_PRGB32); BLContext ctx(img);

ctx.clearAll();

// First shape filled with a radial gradient. // By default, SRC_OVER composition is used. BLGradient radial( BLRadialGradientValues(180, 180, 180, 180, 180)); radial.addStop(0.0, BLRgba32(0xFFFFFFFF)); radial.addStop(1.0, BLRgba32(0xFFFF6F3F)); ctx.fillCircle(180, 180, 160, radial);

// Second shape filled with a linear gradient. BLGradient linear( BLLinearGradientValues(195, 195, 470, 470)); linear.addStop(0.0, BLRgba32(0xFFFFFFFF)); linear.addStop(1.0, BLRgba32(0xFF3F9FFF));

// Use 'setCompOp()' to change a composition operator. ctx.setCompOp(BL_COMP_OP_DIFFERENCE); ctx.fillRoundRect( BLRoundRect(195, 195, 270, 270, 25), linear);

ctx.end();` setCompOp(BL_COMP_OP_DIFFERENCE)；When using this interface, the drawing of fillRoundRect will not take effect.

Nov 23 '23 03:11 HowToExpect

I'm sorry but the portable pipeline at the moment doesn't provide all composition operators. This is still something to do.

Nov 23 '23 21:11 kobalicek

ok

Dec 01 '23 06:12 HowToExpect

@kobalicek how could I use composition operation properly?

Feb 01 '24 20:02 dongzhong

This is something that will be solved by AArch64 JIT - I'm not investing much time into portable pipelines at the moment, the JIT seems more important and its first version will premiere very soon.

Feb 15 '24 23:02 kobalicek

@kobalicek The ctx. clear() function is not fast enough on the arm platform. I found a function that uses arm neon optimization in Skia's previous code, and I think we can refer to this function to optimize functions like ctx. clear() that initialize or fill background colors.

static void memset32(uint32_t* dst, uint32_t value, int n) {
    uint32x4_t   v4  = vdupq_n_u32(value);
    uint32x4x4_t v16 = {{ v4, v4, v4, v4 }};

    while (n >= 16) {
        vst4q_u32(dst, v16);  // This swizzles, but we don't care: all lanes are the same, value.
        dst += 16;
        n   -= 16;
    }
    switch (n / 4) {
        case 3: vst1q_u32(dst, v4); dst += 4;
        case 2: vst1q_u32(dst, v4); dst += 4;
        case 1: vst1q_u32(dst, v4); dst += 4;
    }
    if (n & 2) {
        vst1_u32(dst, vget_low_u32(v4));
        dst += 2;
    }
    if (n & 1) {
        *dst = value;
    }
}

Mar 16 '24 12:03 openlearnc

I found an example that shows us how to optimize alpha blend using arm neon, with a function that can blend 8 pixels.https://github.com/tttapa/ARM-NEON-Compositor

Mar 16 '24 12:03 openlearnc

@openlearnc I have found this approach slow on ARM, the most important thing to do on ARM is to align the destination and then to use pair stores, at least this works great on Apple Silicon.

There is a branch aarch64_jit now, which can be used by people interested in testing the new AArch64 JIT - it's still a little experimental, but it's much faster than portable pipelines.

In addition, the aarch64_jit branch has an optimized filler that specializes for smaller widths as well, which makes both small and large fills a little faster especially on ARM hardware.

Mar 16 '24 13:03 kobalicek

@kobalicek Although the aarch64 jit branch supports more composition operations, testing has found that it is not as fast as the no jit version.

Mar 24 '24 11:03 openlearnc

@openlearnc I'm interested in a workload that performs better without JIT. I have an Apple M3 chip here and I can see between 2-5x performance increase when using JIT. I optimized mostly SRC and SRC_OVER though, so other compositing operators need some optimizations first (as now they are basically using the strategy used by x86, which is not ideal on ARM).

Mar 24 '24 11:03 kobalicek

This is the result of my testing blend2d on an Android device, and all the compilations were done using clang.

Mar 24 '24 12:03 openlearnc

If you are doing a single-shot benchmark like calling something only once, there would be some little overhead to compile each function. After that they are cached, but they have to be compiled the first time.

Maybe Apple M3 is too powerful and has lower latencies of he selected instructions Blend2D prefers, but I would need more info about that. For example Blend2D has no problem in using TBL instruction, which was slow in the past.

But still, JIT understand how to unroll some stuff much better than the C++ compiler, so it's hard to believe that something would be faster. Maybe some very tiny stuff can be better without JIT, when only few pixels per scanline are modified, but I would like to know about these cases.

Mar 24 '24 12:03 kobalicek