ghostplant
ghostplant
Any suggestions? A thousand-line source kernel spends hours to finish its compilation.
@b-sumner Usually, a TVM engine compiling a large model can generate a thousand-line source codes containing many sub kernels. It takes 3-4 hours to compile such files, while nvcc takes...
@b-sumner clang-9 from rocm is sequantially executed, while nvcc compiles source codes using multiple process.
@b-sumner I think the problem is caused by putting too many `__global__` kernels within one source file. It is possible to split them into different standalone source files, with only...
You can follow this example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py, which can be executed with: `python3 -m tutel.examples.helloworld_demo --batch_size=16`
Is that a static parameter that can be set just in `__init__` function of CustomExpertDemo?
Still need a few API upgrades to meet your requirement.
You need to feed extra argument data you need here: https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L238, where `self.experts` is the layer object created from your custom `CustomExpertDemo`. You also need to extend corresponding argument list...
> When I use the Customexpert, it stopped here: if ctx.sharded_count > 1: raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.") > > ```...
So looks like `num_global_experts` is smaller than the number of GPUs, right?