ghostplant
ghostplant
Yes. When the execution setting `num_global_experts < self.world_size`, you will have to handle `if shared_count > 1` which tells the way to partition expert parameters that are distributed across more...
Please follow this example in handling `sharded_count`: https://github.com/microsoft/tutel/blob/main/tutel/experts/llama_ffn.py And another end-to-end example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_custom_expert_sharded.py
Now I get 11Tflops for 2080ti, and 17Tflops for A100, is that reasonable?
Hello @thakkarV, when running cutlass_profiler, I found that `*_sptensorop_*` is geneally faster than `*_tensorop_*` when running a 4Kx4Kx4K GEMM. For example, I get optimal 860TFlops using `_tensorop_` while get optimal...
> sptensorop uses the structured sparse MMA, which is why you see it being faster Thanks, that's reasonable if some area of GEMM inputs are sparse. But if considering a...
> Sparse GEMM forces structures sparsity. It's a totally different kernel and has implications on your workload characteristics. OK, does it mean that **fully random GEMM operation (e.g. torch.matmul(x, y))...
Thank you, then looks like 860Tflops is the peak that cutlass can achieve for dense GEMM.
@yzh119 Is there a choice that directly takes head_dim=576, instead of separated q_pe & q_nope ?
How can I stop receiving a bunch of notifications on this repo every days? I didn't know this repo even.
Hi. What you ask includes "model required cost" and "switching cost". "Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually,...