ZincCat
ZincCat
maybe also need to change doc/main readme
just fixed it, thanks!
Seems that original kernel is binded to gpt oss, I have made it work for qwen3, but it seems deepspeed is causing trouble if I try to merge the experts...
sure, I'll provide my version later
you may reference the commits in https://github.com/zinccat/qwen3_moe_megablocks
it's either `self.weight = nn.Parameter(torch.empty(config.num_experts, config.hidden_size, dtype=torch.bfloat16))` or the class as a subclass of`nn.Linear(config.hidden_size, config.num_experts)`, as specified in your reference
I'm currently using a similar approach
It's quite simple, just replace the _checkpoint_wrapped_model part in the model weight's key