xla icon indicating copy to clipboard operation
xla copied to clipboard

[ROCM] Splitting gridDim.x for very large reductions / loop fusion kernels

Open pemeliya opened this issue 3 months ago • 5 comments

We had several failures on ROCM side due to very large gridDim.x values for fusion kernels. Here is the proposal to split gridDim.x into gridDim.x/y in these cases.

  1. added gridDim.x splitting for reduction and loop fusions to handle large grid sizes
  2. extended gpu_kernel_tiling test and buffer_comparator_test
  3. improved error logging in rocm_stream.cc in case of kernel launch failure

We already had some rocm-related fixes for this issue: with this PR all fixes shall be unified

@xla-rotation: could you have a look please ?

pemeliya avatar Aug 12 '25 13:08 pemeliya

@beckerhe could you perhaps have a look at this PR? It is quite important for ROCM since it fixes kernel launch failures when grid dimensions are too big. Also it removes a few hacks on ROCM side and unifies CUDA/ROCM treatment

pemeliya avatar Sep 30 '25 12:09 pemeliya

IMO this PR is too large. We have a numeric test failing which will be difficult debug when multiple changes are merged into a single PR. Could you split this PR into smaller ones?

Secondly there is a build error and a missing variable error that would need to be addressed before merging:

 error: module xla/backends/gpu/codegen/emitters:reduction does not directly depend on a module exporting 'absl/numeric/bits.h', which is part of indirectly-used module third_party/absl/numeric/bits.h
error: use of undeclared identifier 'hero_operand_index' 

derdrdirk avatar Oct 10 '25 12:10 derdrdirk

IMO this PR is too large. We have a numeric test failing which will be difficult debug when multiple changes are merged into a single PR. Could you split this PR into smaller ones?

Secondly there is a build error and a missing variable error that would need to be addressed before merging:

 error: module xla/backends/gpu/codegen/emitters:reduction does not directly depend on a module exporting 'absl/numeric/bits.h', which is part of indirectly-used module third_party/absl/numeric/bits.h
error: use of undeclared identifier 'hero_operand_index' 

done! The above bazel errors shoild also be fixed now

pemeliya avatar Oct 20 '25 08:10 pemeliya

Hi @xla-rotation may I ask what status of this PR? seems still pending?

i-chaochen avatar Nov 04 '25 12:11 i-chaochen

Hi @xla-rotation may I ask what status of this PR? seems still pending?

Yeah, as I see this PR is not merged yet to upstream.. I wonder if there are some failing internal tests ?

pemeliya avatar Nov 06 '25 10:11 pemeliya