xla
xla copied to clipboard
A machine learning compiler for GPUs, CPUs, and ML accelerators
[XLA:CPU] Verify invariant buffers of `KernelThunk` in the runtime.
Move FindInstruction and FindComputation core functionality from hlo_test_base to hlo_query
CuDnnThunk currently used for GEMM fusions is capable of executing arbitrary cuDNN graphs. Moving FMHA to use it lets remove lots of specialized runtime code. The overview of the change...
[PJRT IFRT] Pass distributed client into the PJRT IFRT layer for TPU (already done for CPU, GPU will be a separate CL). Objective: Let IFRT handle topology exchange and other...
Automated Code Change
[XLA:GPU] Stable ordering of keys in gemm+DS rewriter
[XLA:GPU] Add a method to get all constraints for variables in an indexing map. This will allow us to only iterate over constraints in an indexing map once.
[XLA:GPU] Introduce the `TiledHloFusionInstruction` class. It is to `TiledHloInstruction` what `HloFusionInstruction` is to `HloInstruction`. Its main purpose will be to wrap nested fusions for block-level code generation.
[XLA:GPU] Implement fusing int4 parameters into Triton dots. Right now it works for the simple case where S4 is LHS argument and the contracting dim is minor(1) or not minor(0).
[XLA:GPU] Simplify the C64 cuBLASlt matrix dimension check. Simplify the check that for C64, the non-contracting dimension that is fed to cuBLASlt is short enough. 1. Removed MatrixIsColumnMajor() function (which...