AMDMIGraphX issues

Fuse reductions with MLIR with multi-outputs

5

This issue has two parts. The first part is to fuse reductions(including split reductions) with MLIR, including any pointwise. The second part is to use multiple outputs when fusing, to...

pfultz2

Perf Improve

Bump CI to 6.2 once released

1

Update our [Dockerfile ](https://github.com/ROCm/AMDMIGraphX/blob/develop/Dockerfile) and [hip-clang.docker](https://github.com/ROCm/AMDMIGraphX/blob/develop/hip-clang.docker) Additional files may also be needed https://github.com/ROCm/AMDMIGraphX/tree/develop/tools/docker

causten

Continous Integration

Add weight streaming

11

Add weight streaming to allow running of large models on GPUs with low memory. Closes #3156.

eddieliao

enhancement

Windows

Ubuntu

UAI

Fuse convolution + reshape + transpose + sigmoid

4

``` @404 = gpu::code_object[code_object=6464,symbol_name=mlir_convolution_add,global=102400,local=256,](@402,@293,@400,@403) -> half_type, {1, 255, 80, 80}, {1632000, 6400, 80, 1}, target_id=0: 0.0226192ms, 1% @405 = reshape_lazy[dims={1, 3, 85, 80, 80}](@404) -> half_type, {1, 3, 85, 80,...

umangyadav

Perf Improve

Load and save problem_cache from a file

3

pfultz2

Add weight streaming at runtime

7

Figure out a way to have weight streaming at runtime i.e. be able to fit large models on gpu without needing to know literal size ahead of time - [x]...

eddieliao

enhancement

Windows

Ubuntu

UAI

Under Investigation

UInt64 Overflow with Higher Batch Sizes

There appears to be an occasional issue in which we try to allocate a buffer to the gpu that seems to be an overflow of an UInt64. @kahmed10 has reportedly...

eddieliao

bug

Fix googleFnet for HF model

1

causten

TorchMIGraphX

Concat + Convert Fusion

3

``` 440 = gpu::code_object[code_object=9544,symbol_name=concat_kernel,global=714000,local=1024,](@439,@436,@437,@438) -> half_type, {1, 25200, 85}, {2142000, 85, 1}, target_id=0: 0.0305542ms, 2% main:#output_0 = @param:main:#output_0 -> float_type, {1, 25200, 85}, {2142000, 85, 1}, target_id=0: 0.0007664ms, 1% @442...

umangyadav

Perf Improve

Reuse inner buffers when generating kernels for blockwise reductions

2

Due to the use of `__syncthreads` in the `reduce` methods registers are not reused. We can reuse them directly by assigning to them with `r.inner([](auto& y, auto x) { y...

pfultz2

Perf Improve

AMDMIGraphX
AMDMIGraphX copied to clipboard

Metadata

Fuse reductions with MLIR with multi-outputs

Bump CI to 6.2 once released

Add weight streaming

Fuse convolution + reshape + transpose + sigmoid

Load and save problem_cache from a file

Add weight streaming at runtime

UInt64 Overflow with Higher Batch Sizes

Fix googleFnet for HF model

Concat + Convert Fusion

Reuse inner buffers when generating kernels for blockwise reductions

← Metadata

Owner

Metadata

AMDMIGraphX AMDMIGraphX copied to clipboard

Metadata

← Metadata

Owner

Metadata

AMDMIGraphX
AMDMIGraphX copied to clipboard