xla
xla copied to clipboard
A machine learning compiler for GPUs, CPUs, and ML accelerators
This PR uses cuGraphInstantiateWithParms instead of cuGraphInstantiate to instantiate cuda graph executors, so current command_buffer_cmd_test and command_buffer_thunk_test should cover the changes in this PR.
Adds python bindings for `xla_gpu_kernel_cache_file`, `xla_gpu_enable_llvm_module_compilation_parallelism` and `xla_gpu_per_fusion_autotune_cache_dir`. We would like to add some convenience features to JAX which will enable all caches with one flag/option (will open PR for...
Run build cleaner tooling on StableHLO
#sdy remove IdentityOp as it's no longer needed.
As the 2nd part of #15092. NOTE: this feature relies on cudnn-frontend v1.6.1 which is not in XLA yet.
Host Offloading: Process "MoveToHost" instructions in the order they are executed. - This ensures we process "MoveToHost" instructions that reside at the beginning of a host memory instruction offload chain....
Divides the solver timeout budget equally across all mesh shapes & partial mesh shapes (instead of allowing each invocation to consume the full timeout budget).
Allow custom call computations to contain subcomputations
Expose stablehlo version through the PJRT C API.
[XLA:MSA] Added flags to enable/disable async copy and async slice replacements in memory space assignment. Both features are enabled by default.