carlushuang

Results 13 issues of carlushuang

need to generalize code generation logic for different direction, precision, arch * global load/store: - [ ] support different precision, fp32/fp16(short)/ubyte - [ ] support 2d/3d load, and have exec...

- [x] unify fwd/bwd/wrw direction branch code in conv_driver.cpp - [ ] unify fp32/fp16/int8 logic - [x] unify fwd/bwd/wrw driver code

``` ./bin/MIOpenDriver poolfp16 -M 0 -n 32 -c 192 -H 27 -W 27 -y 3 -x 3 -p 0 -q 0 -v 2 -u 2 -m max -F 1 -t...

enhancement
quality
testing

test on local gfx908 machine and server in US, rocm-3.10/rocm-4.2 using the latest develop 120289fcb33496db05518b24e3a39db01d5adb5c, using following step to build: ``` mkdir build && cd build CXX=/opt/rocm/llvm/bin/clang++ cmake -DMIOPEN_TEST_HALF=On -DMIOPEN_TEST_GFX908=On...

urgency_unknown

- [x] add test topk - [x] add example topk-softmax - [x] add test tile_reduce - [x] add test scatter-gather - [x] add tensor transform support for scatter-gather - [x]...

- [x] add test topk - [x] add test topk-softmax - [x] add test tile_reduce - [x] add test scatter-gather - [x] add tensor transform support for scatter-gather - [x]...

This issue tracks the issues when developing avx2 CK 1. CPU only compile. A lot of headers are included `hip_runtime.h`, and use `__device__` / `__host__` symbol to describe host/device code....

moe_sorting kernel num_tokens > 13K compute error. reproducible in aiter, can't reproducible in tile_example_moe_sorting