xla
xla copied to clipboard
A machine learning compiler for GPUs, CPUs, and ML accelerators
Fixed #25211 by updating the C API header to explicitly mark num_outputs as an output parameter, clarifying the API contract. I also added a new C API test that calls...
Cleaned up coordination service key-value store.
Reverts 189d67b2eab0b53e65bb226cf48082db6f633abb
Move `TopKKernel` behind `GpuKernelRegistry`. * Moves `TopK` logic into `backends/gpu/runtime` since it's a runtime component. * Defines trait for the `TopK` kernel in `stream_executor/gpu/` * Moves the implementations of this...
Add TMA descriptor extraction to launcher Port over the getTmaDesc function and refactor it to follow the extractor API Reenable pipeliner and experimental_tma tests which get fixed by this change
Automated Code Change
PR #24744: Don't clone instructions in HloEvaluator. Imported from GitHub PR https://github.com/openxla/xla/pull/24744 There doesn't appear to be any good reason to do this, but it makes HloEvaluator very dangerous to...
[XLA:GPU][TMA] Use the offsets-sizes-strides interface for triton_xla ops. That improves readability and we can remove our own parsing, printing and some verifications.
Description: - Ported profiler's code from pybind11 to nanobind - Added nanobind mutexes to protect PythonHooks and PythonHookContext under free-threading - Race seen in JAX: https://github.com/jax-ml/jax/actions/runs/14651237353/job/41117294221?pr=28245#step:18:1561 ``` #8 xla::profiler::PythonHookContext::ProfileFast(_frame*, int,...
[XLA:LatencyHidingScheduler] Split ReadySetLt::operator() into multiple functions Split some of the heuristics into different functions to make them easier to read.