xla Calculate active core correctly for fusions which use shared cache

Calculate active core correctly for fusions which use shared cache

Open lingzhi98 opened this issue 1 year ago • 0 comments

For current column reduction codegen, sm core active ratio is low if the last kept dimension is small, can see the below hlo: fused_reduce { param_1.15 = bf16[1,2048]{1,0} parameter(1) bitcast.86.8 = bf16[2048]{0} bitcast(param_1.15) convert.90.5 = f32[2048]{0} convert(bitcast.86.8) broadcast.6.6 = f32[2048,256]{1,0} broadcast(convert.90.5), dimensions={0}, metadata={op_name="jit(func)/jit(main)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]"} param_0.29 = bf16[2048,256]{1,0} parameter(0) convert.83.3 = f32[2048,256]{1,0} convert(param_0.29) multiply.8.3 = f32[2048,256]{1,0} multiply(broadcast.6.6, convert.83.3) constant_9 = f32[] constant(0) reduce.5 = f32[256]{0} reduce(multiply.8.3, constant_9), dimensions={0}, to_apply=scalar_add_computation param_2.12 = bf16[2048,256]{1,0} parameter(2) convert.87.3.clone.1 = f32[2048,256]{1,0} convert(param_2.12) multiply.9.3.clone.1 = f32[2048,256]{1,0} multiply(broadcast.6.6, convert.87.3.clone.1) reduce.1.1.clone.1 = f32[256]{0} reduce(multiply.9.3.clone.1, constant_9), dimensions={0}, to_apply=scalar_add_computation ROOT tuple = (f32[256]{0}, f32[256]{0}) tuple(reduce.5, reduce.1.1.clone.1) } // fused_reduce

I submit related optimization in other PR: split reduced dim to improve active core ratio (this idea is common and be widely used) and reach 2x performance improvment, from 20us to 10us on A100 40GB. But this change will make this test fail. I think it is the bug of the ComputeTime func, which is not suitable for fusions that use shared cache, like reduction and transpose. I move the change to ComputeTime func from the reduction PR to this so that easy to review.

May 07 '24 06:05 lingzhi98

xla xla copied to clipboard

Calculate active core correctly for fusions which use shared cache

xla
xla copied to clipboard