xla icon indicating copy to clipboard operation
xla copied to clipboard

[XLA:GPU] Allow cuDNN scaled dot fusions in the gemm autotuner

Open sergey-kozub opened this issue 6 months ago • 5 comments

📝 Summary of Changes Allow selecting cuDNN gemm configs when autotuning scaled dot fusions.

🎯 Justification cuDNN has a kernel for block scaled dot operations, this PR enables it in the autotuner. Note: XLA flag --xla_gpu_experimental_scaled_dot_with_triton is required to enable this.

🚀 Kind of Contribution ✨ New Feature ⚡️ Performance Improvement

sergey-kozub avatar Oct 15 '25 12:10 sergey-kozub

Following tests fail

  • CollectiveOpsTestE2EShardedUnsharded.BlockScaledDotNonContractingAndContracting on B200.
  • GemmFusionAutotunerLevelSweep/GemmFusionAutotunerLevelTest.Deviceless/3 on H100.

Can you please fix?

golechwierowicz avatar Oct 30 '25 09:10 golechwierowicz

GemmFusionAutotunerLevelSweep/GemmFusionAutotunerLevelTest.Deviceless/3

Fixed the issue breaking this test.

CollectiveOpsTestE2EShardedUnsharded.BlockScaledDotNonContractingAndContracting

This test fails for me on HEAD (i.e. seems unrelated to this PR), could you please confirm? (whether it passes for you on HEAD).

sergey-kozub avatar Nov 03 '25 12:11 sergey-kozub

This test fails for me on HEAD (i.e. seems unrelated to this PR), could you please confirm? (whether it passes for you on HEAD).

It passes for us at HEAD.

golechwierowicz avatar Nov 04 '25 11:11 golechwierowicz

It passes for us at HEAD.

Looking into this.

sergey-kozub avatar Nov 04 '25 11:11 sergey-kozub

And also please make sure you include the right build deps

to fix run:
	build_cleaner... /xla/service/gpu/transforms:block_scaling_rewriter_test
   24 | #include "third_party/tensorflow/compiler/xla/hlo/ir/hlo_casting_utils.h"
      |          ^
to fix run:
	build_cleaner... /xla/service/gpu/transforms:block_scaling_rewriter_test
   26 | #include "third_party/tensorflow/compiler/xla/hlo/testlib/filecheck.h"

golechwierowicz avatar Nov 04 '25 13:11 golechwierowicz