ghostplant

Results 272 comments of ghostplant

Yes. When the execution setting `num_global_experts < self.world_size`, you will have to handle `if shared_count > 1` which tells the way to partition expert parameters that are distributed across more...

Please follow this example in handling `sharded_count`: https://github.com/microsoft/tutel/blob/main/tutel/experts/llama_ffn.py And another end-to-end example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_custom_expert_sharded.py

Now I get 11Tflops for 2080ti, and 17Tflops for A100, is that reasonable?

Hello @thakkarV, when running cutlass_profiler, I found that `*_sptensorop_*` is geneally faster than `*_tensorop_*` when running a 4Kx4Kx4K GEMM. For example, I get optimal 860TFlops using `_tensorop_` while get optimal...

> sptensorop uses the structured sparse MMA, which is why you see it being faster Thanks, that's reasonable if some area of GEMM inputs are sparse. But if considering a...

> Sparse GEMM forces structures sparsity. It's a totally different kernel and has implications on your workload characteristics. OK, does it mean that **fully random GEMM operation (e.g. torch.matmul(x, y))...

Thank you, then looks like 860Tflops is the peak that cutlass can achieve for dense GEMM.

@yzh119 Is there a choice that directly takes head_dim=576, instead of separated q_pe & q_nope ?

How can I stop receiving a bunch of notifications on this repo every days? I didn't know this repo even.

Hi. What you ask includes "model required cost" and "switching cost". "Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually,...