xla
xla copied to clipboard
A machine learning compiler for GPUs, CPUs, and ML accelerators
[Autotuner] Make buffer checking best effort, rather than forcing it. - There are cases in gemm_fusion_autotuner where we don't have a reference output from cuBLAS and we skip the requested...
[IFRT Proxy]Make `ifrt_proxy::client::LoadedExecutable` implement `MpmdLoadedExecutableInterface`. This change updates the `LoadedExecutable` class in the IFRT proxy client to inherit from `xla::ifrt::MpmdLoadedExecutableInterface` and adds declarations for the MPMD-specific methods.
[XLA:GPU] enable dynamic slice support replace usages of legacy IsTritonSupportedDynamicSlice
Reverts 9cf521108c2b54328e973c05bd19941c476a5c3c
Move CustomKernelThunk into its own file CustomKernelThunk is currently declared in kernel_thunk.h and this change moves it into its own file custom_kernel_thunk.h. The same is done for the implementation (kernel_thunk.cc...
KernelSpecTest improvements and cleanups - Improves how we invent pointers to CUDA kernels - Adds parameter comments for ambigious parameters - Makes use of `ParseTextProtoOrDie`
📝 Summary of Changes - Adding a heurisitic to GPU-scheduler for having better MoveToHost overlapping. 🎯 Justification This could help hide D2H/H2D data movement behind computations. 🚀 Kind of Contribution...
Enable f32 dots by default in YNNPACK We expect this to be a small speedup of f32 dots by wall clock time, but a significant improvement in CPU time (~30%)....
Tensorflow version 2.19 Python version 3.10 Bazel version 6.5.0 GCC compiler version 15.2.0 CUDA and cuDNN version 12.6.1 9.4.0 Rocm version 6.2.0 LLVM 18.1.8(system side) LLVM Rocm 18.0.0git GPU model...
📝 Summary of Changes - Addin a knob to control the limitation of async-compute resource. This switch provides ample flexibility for control, enabling more asynchronous computations to execute concurrently. In...