pytorch issues

Enable NHWC batchnorm for miopen

2

Environment variable PYTORCH_MIOPEN_SUGGEST_NHWC=1 enables MIOpen batchnorm for NHWC

[ROCm] Integrate hipblasLT AMAX_D pointer

Reverts the AMAX workaround now that hipblasLT supports AMAX. hipblasLT does not like getting nullptr for scale D so we create a dummy scalar tensor with the value of 1.0...

alugorey

[DO NOT MERGE] amdsmi review

jataylo

Implementing own quite naive gemv kernel as replacement of default used in nn.Linear gives 20% better speed on MI100

### 🐛 Describe the bug Hi, I profiled the generation of text with the Mistral 7b LLM on my MI100 GPU and saw that some gemv fp16 kernels don't seem...

Epliz

torch multinomial causes severe stall in Hugginface Transformers LLM generation

1

### 🐛 Describe the bug Hi, When doing text generation with Mistral 7b with Hugginface transformers on a MI100 GPU, I can see in the collected torch trace that a...

Epliz

Add support for memory efficient attention for AMD/ROCm

1

### 🚀 The feature, motivation and pitch Enable support for Flash Attention Memory Efficient and SDPA kernels for AMD GPUs. At present using these gives below warning with latest nightlies...

Looong01

fastai example no longer runs on 7900XTX using ROCm 6.1

1

### 🐛 Describe the bug quick start example from fastai course no longer runs. I get the following error when running [quickstart.py ](https://gist.github.com/briansp2020/92c56be60caeb2e497eaa275b85e13fb). It used to run fine when I...

briansp2020

add mappings

add std and cub::BLOCK_LOAD_WARP_TRANSPOSE Jira ID: [SWDEV-458189](https://ontrack-internal.amd.com/browse/SWDEV-458189) Fixes #ISSUE_NUMBER

SeanSong-amd

Fix hipblasLT workaround amax bug

1

Pushing to our internal fork. Already merged upstream here: https://github.com/pytorch/pytorch/pull/123275

alugorey

Enable previously disabled FA related Operators in UTs

They were disabled in AOTriton V1, but V2 should fix most of them. Passed with ``` PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_meta.py -k flash_attention -v PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_ops.py -k flash_attention -v...

xinyazhang

pytorch
pytorch copied to clipboard

Metadata

Enable NHWC batchnorm for miopen

[ROCm] Integrate hipblasLT AMAX_D pointer

[DO NOT MERGE] amdsmi review

Implementing own quite naive gemv kernel as replacement of default used in nn.Linear gives 20% better speed on MI100

torch multinomial causes severe stall in Hugginface Transformers LLM generation

Add support for memory efficient attention for AMD/ROCm

fastai example no longer runs on 7900XTX using ROCm 6.1

add mappings

Fix hipblasLT workaround amax bug

Enable previously disabled FA related Operators in UTs

← Metadata

Owner

Metadata

pytorch pytorch copied to clipboard

Metadata

← Metadata

Owner

Metadata

pytorch
pytorch copied to clipboard