Daniel Stokes
Daniel Stokes
Hi @akhoroshev, first off thanks for the contribution. I agree with @nv-guomingz about having this be a separate model, but also that this is something we could handle separately after...
> I agree that the is_moe_layer function is better. But what about dense_intermidiate_size param? It's ok or we need more general solution? This is a good question, perhaps a list...
Thanks @akhoroshev that makes perfect sense to me. Feel free to make that change to this PR if you would like I discussed re shared experts, and the verdict was...
Hi @Ahmad-Magdy-Osman, currently these changes are being tested on our internal branch. Once they are accepted internally they will be released in one of our upcoming weekly releases. We will...
This LGTM, thanks @jinyangyuan-nvidia. One bigger change would be, with this sort of approach, I think it would be good to consider if we could couple this with the DP...