ghostplant comments

Results 272 comments of


                                            ghostplant

How to create a custom expert with tutel?

Yes. When the execution setting `num_global_experts < self.world_size`, you will have to handle `if shared_count > 1` which tells the way to partition expert parameters that are distributed across more...

How to create a custom expert with tutel?

Please follow this example in handling `sharded_count`: https://github.com/microsoft/tutel/blob/main/tutel/experts/llama_ffn.py And another end-to-end example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_custom_expert_sharded.py

How to change the autotune setting for kernel 9?

Now I get 11Tflops for 2080ti, and 17Tflops for A100, is that reasonable?

[QST]What is the difference between `TensorOp` and `WmmaTensorOp`

Hello @thakkarV, when running cutlass_profiler, I found that `*_sptensorop_*` is geneally faster than `*_tensorop_*` when running a 4Kx4Kx4K GEMM. For example, I get optimal 860TFlops using `_tensorop_` while get optimal...

[QST]What is the difference between `TensorOp` and `WmmaTensorOp`

> sptensorop uses the structured sparse MMA, which is why you see it being faster Thanks, that's reasonable if some area of GEMM inputs are sparse. But if considering a...

[QST]What is the difference between `TensorOp` and `WmmaTensorOp`

> Sparse GEMM forces structures sparsity. It's a totally different kernel and has implications on your workload characteristics. OK, does it mean that **fully random GEMM operation (e.g. torch.matmul(x, y))...

[QST]What is the difference between `TensorOp` and `WmmaTensorOp`

Thank you, then looks like 860Tflops is the peak that cutlass can achieve for dense GEMM.

Does flashinfer support head_size = 576 for Ampere GPUs?

@yzh119 Is there a choice that directly takes head_dim=576, instead of separated q_pe & q_nope ？

Update azure-pipelines-ci.yml for Azure Pipelines

How can I stop receiving a bunch of notifications on this repo every days? I didn't know this repo even.

How about the cost of TUTEL features?

Hi. What you ask includes "model required cost" and "switching cost". "Model-required cost" is the trivial cost needed to compute the model regardless of switching from another parallel configuration. Usually,...