AMDMIGraphX
AMDMIGraphX copied to clipboard
AMD's graph optimization engine.
This issue has two parts. The first part is to fuse reductions(including split reductions) with MLIR, including any pointwise. The second part is to use multiple outputs when fusing, to...
Update our [Dockerfile ](https://github.com/ROCm/AMDMIGraphX/blob/develop/Dockerfile) and [hip-clang.docker](https://github.com/ROCm/AMDMIGraphX/blob/develop/hip-clang.docker) Additional files may also be needed https://github.com/ROCm/AMDMIGraphX/tree/develop/tools/docker
Add weight streaming to allow running of large models on GPUs with low memory. Closes #3156.
``` @404 = gpu::code_object[code_object=6464,symbol_name=mlir_convolution_add,global=102400,local=256,](@402,@293,@400,@403) -> half_type, {1, 255, 80, 80}, {1632000, 6400, 80, 1}, target_id=0: 0.0226192ms, 1% @405 = reshape_lazy[dims={1, 3, 85, 80, 80}](@404) -> half_type, {1, 3, 85, 80,...
Figure out a way to have weight streaming at runtime i.e. be able to fit large models on gpu without needing to know literal size ahead of time - [x]...
There appears to be an occasional issue in which we try to allocate a buffer to the gpu that seems to be an overflow of an UInt64. @kahmed10 has reportedly...
``` 440 = gpu::code_object[code_object=9544,symbol_name=concat_kernel,global=714000,local=1024,](@439,@436,@437,@438) -> half_type, {1, 25200, 85}, {2142000, 85, 1}, target_id=0: 0.0305542ms, 2% main:#output_0 = @param:main:#output_0 -> float_type, {1, 25200, 85}, {2142000, 85, 1}, target_id=0: 0.0007664ms, 1% @442...
Due to the use of `__syncthreads` in the `reduce` methods registers are not reused. We can reuse them directly by assigning to them with `r.inner([](auto& y, auto x) { y...