xla [XLA:GPU] Allow cuDNN scaled dot fusions in the gemm autotuner

📝 Summary of Changes Allow selecting cuDNN gemm configs when autotuning scaled dot fusions.

🎯 Justification cuDNN has a kernel for block scaled dot operations, this PR enables it in the autotuner. Note: XLA flag --xla_gpu_experimental_scaled_dot_with_triton is required to enable this.

🚀 Kind of Contribution ✨ New Feature ⚡️ Performance Improvement

Oct 15 '25 12:10 sergey-kozub

Following tests fail

CollectiveOpsTestE2EShardedUnsharded.BlockScaledDotNonContractingAndContracting on B200.
GemmFusionAutotunerLevelSweep/GemmFusionAutotunerLevelTest.Deviceless/3 on H100.

Can you please fix?

Oct 30 '25 09:10 golechwierowicz

GemmFusionAutotunerLevelSweep/GemmFusionAutotunerLevelTest.Deviceless/3

Fixed the issue breaking this test.

CollectiveOpsTestE2EShardedUnsharded.BlockScaledDotNonContractingAndContracting

This test fails for me on HEAD (i.e. seems unrelated to this PR), could you please confirm? (whether it passes for you on HEAD).

Nov 03 '25 12:11 sergey-kozub

This test fails for me on HEAD (i.e. seems unrelated to this PR), could you please confirm? (whether it passes for you on HEAD).

It passes for us at HEAD.

Nov 04 '25 11:11 golechwierowicz

It passes for us at HEAD.

Looking into this.

Nov 04 '25 11:11 sergey-kozub

And also please make sure you include the right build deps

to fix run:
	build_cleaner... /xla/service/gpu/transforms:block_scaling_rewriter_test
   24 | #include "third_party/tensorflow/compiler/xla/hlo/ir/hlo_casting_utils.h"
      |          ^
to fix run:
	build_cleaner... /xla/service/gpu/transforms:block_scaling_rewriter_test
   26 | #include "third_party/tensorflow/compiler/xla/hlo/testlib/filecheck.h"

Nov 04 '25 13:11 golechwierowicz