xla issues

Reverts b5c9e8b945cea7169b828b0760bef501ae7c8d6f

copybara-service[bot]

Enable effective scalar dynamic slice fuse into DUS.

copybara-service[bot]

[NVIDIA] Don't use c_scale when the operand c is non-fp8

1

For current fp8 gemm, we set the c_scale to one, though it is effectively never used. Newer cublaslt, however, has a stricter requirement that c_scale can be set only when...

kaixih

Allow fusing epilogues whose operands are broadcast of effective-scalar instructions.

5

Allow fusing epilogues whose operands are broadcast of effective-scalar instructions. This enables creating fusions for fp8 where the pattern is `mul(dot, scalar_ops)` where scalar ops's shapes are either [] or...

elfiegg

[XLA:GPU] Add participating groups to NCCL clique key to fix split hang

1

When using `--xla_gpu_enable_nccl_comm_splitting=true`, it is possible for a deadlock to occur if one or more subgroups of a split was already created and those devices reuse it from the clique...

trevor-m

Add a version of CreateBuffersForAsyncHostToDevice that takes a custom layout.

copybara-service[bot]

Add utility function for determining collectives that are not inside custom fusions.

copybara-service[bot]

[GPU][NFC] Refactor cuDNN fusion compiler.

The added structure Result will be used to add support of slicing.

sergachev

In `bazel_query.yml` instead query for `deps(//xla/...)`

In `bazel_query.yml` instead query for `deps(//xla/...)` Consistent with https://github.com/tensorflow/tensorflow/blob/master/ci/official/utilities/code_check_full.bats#L312

copybara-service[bot]

[XLA:GPU] remove some redundant mutex lock in fusion analysis cache invalidation

1

Cjkkkk

xla
xla copied to clipboard

Metadata

Reverts b5c9e8b945cea7169b828b0760bef501ae7c8d6f

Enable effective scalar dynamic slice fuse into DUS.

[NVIDIA] Don't use c_scale when the operand c is non-fp8

Allow fusing epilogues whose operands are broadcast of effective-scalar instructions.

[XLA:GPU] Add participating groups to NCCL clique key to fix split hang

Add a version of CreateBuffersForAsyncHostToDevice that takes a custom layout.

Add utility function for determining collectives that are not inside custom fusions.

[GPU][NFC] Refactor cuDNN fusion compiler.

In `bazel_query.yml` instead query for `deps(//xla/...)`

[XLA:GPU] remove some redundant mutex lock in fusion analysis cache invalidation

← Metadata

Owner

Metadata

xla xla copied to clipboard

Metadata

← Metadata

Owner

Metadata

xla
xla copied to clipboard