faresobeid
faresobeid
Thank you!
Sorry to reopen this issue but I have been having some issues with stability at scale with MLA. Like I said before I am using a hybrid model and therefore...
Yes, although stability has been fine without the inner rms norm but still any recommendations would be helpful
> why you choose FFN 3.5x instead of 3x? Because removing the silu gate saves params which can be put into the FFN
I think they mention that this is only done for the 1.5B not the 14B yet. Would defo love to see this merged into verl