xla
xla copied to clipboard
A machine learning compiler for GPUs, CPUs, and ML accelerators
Reverts b5c9e8b945cea7169b828b0760bef501ae7c8d6f
Enable effective scalar dynamic slice fuse into DUS.
For current fp8 gemm, we set the c_scale to one, though it is effectively never used. Newer cublaslt, however, has a stricter requirement that c_scale can be set only when...
Allow fusing epilogues whose operands are broadcast of effective-scalar instructions. This enables creating fusions for fp8 where the pattern is `mul(dot, scalar_ops)` where scalar ops's shapes are either [] or...
When using `--xla_gpu_enable_nccl_comm_splitting=true`, it is possible for a deadlock to occur if one or more subgroups of a split was already created and those devices reuse it from the clique...
Add a version of CreateBuffersForAsyncHostToDevice that takes a custom layout.
Add utility function for determining collectives that are not inside custom fusions.
The added structure Result will be used to add support of slicing.
In `bazel_query.yml` instead query for `deps(//xla/...)` Consistent with https://github.com/tensorflow/tensorflow/blob/master/ci/official/utilities/code_check_full.bats#L312