xla [XLA:GPU] Improve memory bandwidth utilization of column reduction

trafficstars

Column reduction support vectorization previously, removed after this PR. The reason of disable vectorization is that find no performance gain and the vectorization heuristic is fairly complex and different from the one for row reductions. But I find vectorization in column reduction is really helpful in gemma model. To re-support it, I modify previous vectorization heuristic to make its logic simple. Now column reduction vectorization only check 2 conditions: (1) if vectorization will introduce large overhead, checked by MayPreventVectorization like row reduction. (2) if the last dimension is divisible by vectorization factor. Though vectorization can get better memory bandwidth, it will decrease active sm core number which will cause worse memory bandwidth. The previous column reduction vectorization heuristic don't enable vectorization if active sm core is smaller than max core number. But if we split reduced dimension into multi part, it can increase active sm core number so that improve memory bandwidth utilization. Though atomic and initialized thunk will be needed, i still find it has performance improvement compared with no vectorization. And for small minor kept dimension, current column reduction codegen will use small sm core. Splitting reduced dimension can also get better performance in this situation. After enable above optimizations, gemma 2b (bs1, bfloat16, greedy search, 1024 input tokens, 128 output tokens) e2e performance has 1.1x speedup on A100 40GB.

Mar 28 '24 07:03 lingzhi98

@jreiffers given that you worked on the reduction emitter quite a bit and know the latest state, can you please review this?

Apr 08 '24 13:04 akuegel

Hi @jreiffers , can you look into this once? Thanks.

Apr 12 '24 05:04 kamaljeeti

Thanks for your PR and sorry for the delay, I didn't see this. I just took a quick look and have a few comments:

We need to make sure this is modeled properly in the cost model. For this, the ReductionInfo::ComputeThreadIdToInputIndexing function in reduction_base.cc may need to be updated. We'll definitely need a new test for it.
The reduction emitter is in maintenance mode, the plan is to replace it with the MLIR version. Therefore, if this change is beneficial, we should update the latter as well (which should be significantly easier, it shouldn't be much more than updating the thread -> input indexing).
Please add integration tests for the new logic (service/gpu/tests/reduction_vectorization_test.cc).

I'll take a deeper look later.

Apr 12 '24 06:04 jreiffers

I had controlled where the new vectorization logic would impact. The reason is that the new logic should only influence legacy llvm ir emitter for safety. I really agree with your comments, mlir emitter may need update. Can I support same logic for mlir emitter in the future but not mixed in this PR (I can not update mlir emitter immediately due to i am working on another reduction optimization, may need wait me completing the work) ?

Apr 12 '24 10:04 lingzhi98

Hi @jreiffers, I have updated reduction mlir emitter and added related test. I see you made a big change to the reduction mlir emitter so I often have to deal with content conflicts. So if you have time, can you take a look at this PR? It is really helpful for me. Thanks!

Apr 24 '24 06:04 lingzhi98

Hi @jreiffers, When I try to merge you newest update, I find the atomic support is not right. Your current implementation put initialization and atomic operation in one kernel, there is no way to guarantee that initialization always occurs before atomic. For example, block 0 and block 1 handle same output elements but the memory order may be same as below: block 0 init - > block 0 atomic - > block 1 init - > block 1 atomic You can run the row reduction atomic test which add by you, should occur accuracy error on A100 40GB. Perhaps should separate reduction into 2 kernels if atomic.

Apr 25 '24 07:04 lingzhi98

Thanks! I just noticed the atomic issue as well. I'll fix this today. Looking at this PR now.

Apr 25 '24 13:04 jreiffers

Thank you for the updates. I need to think about the design of this a bit. I don't think we can implement vectorization in MLIR like this. If we can't rely on the auto-vectorizer and need to do things explicitly like here, we should be using vector types instead of unrolling all the signatures. We probably need to build some infrastructure for that first. Sorry for the delays, but it's important that we get this right.

Apr 25 '24 13:04 jreiffers

Thanks. I agree with your advice, perhaps we should use vec types for mlir. Vec types can avoid a lot of duplicate code. Can I submit another PR which only include LLVM emitter change and see what we want to do for MLIR vectorization? Now it looks like it will be difficult for us to solve everything in this PR.

Apr 25 '24 14:04 lingzhi98

We're about to replace the LLVM emitters, so I don't think we should invest significant effort into making changes to them at this point.

For MLIR, how about this:

We change EmitThreadLoopNest as follows:
- rename outputs to iter_args
- make iter_args a vector of ValueRanges
- undo the nested_level change. Instead, the caller can replace the symbol for the innermost loop with 0.
We undo the other changes in mlir_fusion_emitter. In particular, the signature of the entry function shouldn't change.

I need to think a bit more about the reduction emitter, but I think this approach should work and lead to too many new special cases in our APIs.

Also, let's split the cost model change from this PR, since it doesn't seem particularly related. Was it motivated by actual measurements, or does it just feel right?

Apr 25 '24 18:04 jreiffers

I have removed all of the changes in mlir_fusion_emitter. I don't particularly understand your suggestion of EmitThreadLoopNest, but I think it also need change the usual usage of EmitThreadLoopNest. Why not use alloca like LLVM emitter, only small change to reduction MLIR emitter but not others. You can see my latest commits to know the details. Please ignore the format, just a draft.

Apr 26 '24 06:04 lingzhi98

For cost model, just to make gpu performance model tests pass. As you can see, if reduction kernel use small sm core, I choose split reduced dim to improve active core ratio. This idea is common and be widely used. But it make https://github.com/openxla/xla/blob/main/xla/service/gpu/model/gpu_performance_model_test.cc#L567 fail, which caused by incorrect compute time calculation. So I can not split them into other PR otherwise this PR can not pass preci.

Apr 26 '24 07:04 lingzhi98

We can rotate iter args to support column reduction vectorization in the nested loop easily (no alloca, no unroll explicitly). The code is very concise, maybe you can take a look. And I find there exist a bug in ComputeExpressionRange so that affine map constraint check can not be erased successfully. As a result, llvm loop vectorization pass fail and there exist no ld.global.nc.v2 in ptx file. The details can be seen in this PR.

May 01 '24 13:05 lingzhi98

Can you explain in a bit more detail what you mean by "rotate iter args"?

Also, I recommend not spending any more time on the non-MLIR emitters. They're not going to be around much longer.

May 06 '24 07:05 jreiffers

You can see this example: val0 = 0 val1 = 0 iter_args = [val0, val1] for i = 0; i < 2; ++i cur_val = iter_args[0] cur_result = reduce computation(cur_val) iter_args = [iter_args[1], cur_result]

May 06 '24 07:05 lingzhi98

Ah! Great idea. Can you revert all the legacy emitter stuff from this PR and move the performance model stuff to a separate PR to be landed first? I think that'll make it easier to land this change.

May 06 '24 08:05 jreiffers

ok, will do it. By the way, I find transpose does not support vectorization so that performance is not good enough. Does openxla have plan to support it in transpose mlir emitter?

May 06 '24 08:05 lingzhi98

Yeah, we've run into that issue a few times. It's particularly bad with types <32 bits. We have no immediate plans to work on this, since eventually, this should all go away in favor of block-level codegen. However, that will take a while, so I think it would still be worthwhile to do. Happy to review your PR if you want to contribute it (but if the PR gets too big, please reach out first so we can discuss how to split it).

May 06 '24 08:05 jreiffers

Sounds great! Let's go back to the reduction emitter. There exist a issue if revert legacy emitter: reduction info will enable column reduction vectorization but llvm emitter doesn't support it. Maybe we need extra argument (such as is_mlir) for ReductionInfo::create() func to prevent column reduction vectorization when using legacy emitter. So you recommend add argument and removed in the future or keep the change to legacy emitter?

May 06 '24 08:05 lingzhi98

I'm about to add exactly such an argument to ReductionInfo::Create (bool for_mlir). Feel free to just ignore the legacy emitter for now, I think I'll probably land this change today or tomorrow.

May 06 '24 09:05 jreiffers

I'm about to add exactly such an argument to ReductionInfo::Create (bool for_mlir). Feel free to just ignore the legacy emitter for now, I think I'll probably land this change today or tomorrow.

See https://github.com/openxla/xla/pull/12178

May 06 '24 11:05 jreiffers

Cost model PR see #12208.

May 07 '24 06:05 lingzhi98

Hi @jreiffers , I see that MLIR emitter still uses LLVM IR as code-gen from the code comment, could you please point out the place to lower MLIR to LLVM IR?

May 11 '24 03:05 Zantares

Hi @jreiffers , I see that MLIR emitter still uses LLVM IR as code-gen from the code comment, could you please point out the place to lower MLIR to LLVM IR?

Sorry, I missed this.

The lowering is in mlir_fusion_emitter.cc, mostly in CreateLLVMModule. Let me know if you have any specific questions, happy to help.

May 15 '24 15:05 jreiffers

Hi @jreiffers , I see that MLIR emitter still uses LLVM IR as code-gen from the code comment, could you please point out the place to lower MLIR to LLVM IR?

Sorry, I missed this.

The lowering is in mlir_fusion_emitter.cc, mostly in CreateLLVMModule. Let me know if you have any specific questions, happy to help.

Got it, thanks! We are following this way to support SPIRV target in MLIR path. I also discussed this with @penpornk , not sure if she has reached you or not.

Another question is, do you have any public plan/RFC to show MLIR related changes in OpenXLA? We were told that HLO-LLVM IR emitter will not be removed so soon, and MLIR work is started with a small scope. When should we prepare for MLIR path for other ops rather than fusions?

May 16 '24 01:05 Zantares

Got it, thanks! We are following this way to support SPIRV target in MLIR path. I also discussed this with @penpornk , not sure if she has reached you or not.

Not yet, I believe she's OOO at the moment.

Another question is, do you have any public plan/RFC to show MLIR related changes in OpenXLA?

There's not currently any public RFC, sorry.

We were told that HLO-LLVM IR emitter will not be removed so soon, and MLIR work is started with a small scope. When should we prepare for MLIR path for other ops rather than fusions?

Originally, this project started with a small scope, but we have decided to replace the legacy emitters entirely. I would recommend not to invest any more effort into them. The MLIR emitters are now feature complete, I'm basically just tracking down some last loose ends.

Ops other than fusions: are you referring to ir_emitter_unnested? That doesn't do much codegen, so everything new can (and probably should) use the MLIR path. Maybe for trivial things there might still be a reason to go straight from HLO->LLVM, but mid-term, everything interesting will go through MLIR.

May 16 '24 06:05 jreiffers

@jreiffers @lingzhi98

Is it possible to get micro-benchmark results for this and https://github.com/openxla/xla/commit/72788f177dde61e1efbfd744435aa1985a4ff6c0 ?

Overall, vectorization is super-beneficial for small datatypes, so I'd be shocked if it makes no difference for column reductions over s8.

May 16 '24 07:05 cheshire

Is it possible to get micro-benchmark results for this

Feel free to give it a shot. There are some indexing issues still not resolved (I thought the previous PR replaced all the affine apply ops, but doesn't look like that's the case), but for microbenchmarking it shouldn't matter.

and https://github.com/openxla/xla/commit/72788f177dde61e1efbfd744435aa1985a4ff6c0 ?

Probably not, that change is rather old by now. But you'll probably get what you expect, it's just that these reductions don't seem to matter very much in most real models (until now).

May 16 '24 08:05 jreiffers

Just took another look at this. Similar to transpose vectorization, it's great that this works, but the code is a bit too complex for me, due to limitations of the infrastructure). As discussed earlier, I'll look into whether we can easily support vector types to hopefully make this a bit more understandable.

May 21 '24 14:05 jreiffers

@lingzhi98 Please take a look at https://github.com/openxla/xla/pull/12941, if you have time. Thank you!

May 22 '24 13:05 jreiffers

xla xla copied to clipboard

[XLA:GPU] Improve memory bandwidth utilization of column reduction

xla
xla copied to clipboard