xla
xla copied to clipboard
[ROCM] Splitting gridDim.x for very large reductions / loop fusion kernels
We had several failures on ROCM side due to very large gridDim.x values for fusion kernels. Here is the proposal to split gridDim.x into gridDim.x/y in these cases.
- added gridDim.x splitting for reduction and loop fusions to handle large grid sizes
- extended gpu_kernel_tiling test and buffer_comparator_test
- improved error logging in rocm_stream.cc in case of kernel launch failure
We already had some rocm-related fixes for this issue: with this PR all fixes shall be unified
@xla-rotation: could you have a look please ?
@beckerhe could you perhaps have a look at this PR? It is quite important for ROCM since it fixes kernel launch failures when grid dimensions are too big. Also it removes a few hacks on ROCM side and unifies CUDA/ROCM treatment
IMO this PR is too large. We have a numeric test failing which will be difficult debug when multiple changes are merged into a single PR. Could you split this PR into smaller ones?
Secondly there is a build error and a missing variable error that would need to be addressed before merging:
error: module xla/backends/gpu/codegen/emitters:reduction does not directly depend on a module exporting 'absl/numeric/bits.h', which is part of indirectly-used module third_party/absl/numeric/bits.h
error: use of undeclared identifier 'hero_operand_index'
IMO this PR is too large. We have a numeric test failing which will be difficult debug when multiple changes are merged into a single PR. Could you split this PR into smaller ones?
Secondly there is a build error and a missing variable error that would need to be addressed before merging:
error: module xla/backends/gpu/codegen/emitters:reduction does not directly depend on a module exporting 'absl/numeric/bits.h', which is part of indirectly-used module third_party/absl/numeric/bits.herror: use of undeclared identifier 'hero_operand_index'
done! The above bazel errors shoild also be fixed now
Hi @xla-rotation may I ask what status of this PR? seems still pending?
Hi @xla-rotation may I ask what status of this PR? seems still pending?
Yeah, as I see this PR is not merged yet to upstream.. I wonder if there are some failing internal tests ?