[XLA:GPU] Allow cuDNN scaled dot fusions in the gemm autotuner
📝 Summary of Changes Allow selecting cuDNN gemm configs when autotuning scaled dot fusions.
🎯 Justification
cuDNN has a kernel for block scaled dot operations, this PR enables it in the autotuner.
Note: XLA flag --xla_gpu_experimental_scaled_dot_with_triton is required to enable this.
🚀 Kind of Contribution ✨ New Feature ⚡️ Performance Improvement
Following tests fail
-
CollectiveOpsTestE2EShardedUnsharded.BlockScaledDotNonContractingAndContractingon B200. -
GemmFusionAutotunerLevelSweep/GemmFusionAutotunerLevelTest.Deviceless/3on H100.
Can you please fix?
GemmFusionAutotunerLevelSweep/GemmFusionAutotunerLevelTest.Deviceless/3
Fixed the issue breaking this test.
CollectiveOpsTestE2EShardedUnsharded.BlockScaledDotNonContractingAndContracting
This test fails for me on HEAD (i.e. seems unrelated to this PR), could you please confirm? (whether it passes for you on HEAD).
This test fails for me on HEAD (i.e. seems unrelated to this PR), could you please confirm? (whether it passes for you on HEAD).
It passes for us at HEAD.
It passes for us at HEAD.
Looking into this.
And also please make sure you include the right build deps
to fix run:
build_cleaner... /xla/service/gpu/transforms:block_scaling_rewriter_test
24 | #include "third_party/tensorflow/compiler/xla/hlo/ir/hlo_casting_utils.h"
| ^
to fix run:
build_cleaner... /xla/service/gpu/transforms:block_scaling_rewriter_test
26 | #include "third_party/tensorflow/compiler/xla/hlo/testlib/filecheck.h"