faresobeid

Results 5 comments of faresobeid

Sorry to reopen this issue but I have been having some issues with stability at scale with MLA. Like I said before I am using a hybrid model and therefore...

Yes, although stability has been fine without the inner rms norm but still any recommendations would be helpful

> why you choose FFN 3.5x instead of 3x? Because removing the silu gate saves params which can be put into the FFN

I think they mention that this is only done for the 1.5B not the 14B yet. Would defo love to see this merged into verl