ghostplant
ghostplant
Hi @whchung , `HIP_VISIBLE_DEVICES=-1` is not my purpose. I want to test how hip data path uses `libmcwamp_cpu.so` when AMD GPU (or `libmcwamp_hsa.so`) is not available, because I think `hcc`...
@whchung So do you mean `libmcwamp_cpu.so` is actually not useful?
@whchung Is there a user example to show what CPU-mode (libmcwamp_cpu.so) is used for? Thanks!
One GPU per machine? Can you explain how many machines you'd like to run it? Or you just want to run it using 1 GPU on 1 machine?
If you run it with a one-gpu machine, seems like you need to ensure this GPU memory size is enough to store all 32-expert parameters. The way to convert `swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth`...
According to bandwidth profiling, there is no speed difference between `ncclInt8 x N` and `ncclInt32 x N / 4`, so you can choose either.
You can reference the implementation here: https://github.com/microsoft/tutel/blob/main/tutel/experts/ffn.py
It stores a list of unique index destinations that input tokens are to be written on for the following dispatching.
What about `export FAST_CUMSUM=0` first?
Gotcha, this problem is not from `tutel::cumsum`. Instead, you may perform an improper installation of Tutel that only enables CPU support rather than CUDA. The root cause could be an...