Jani Monoses
Jani Monoses
@huybery could it be due to possible differences of how silu/swiglu is implemented in Olmoe and the existing FusedMode module?
The RMSNorm outputs differ. Fixing that will correct at least some of the differences between the two model attention outputs. It can be seen by switching forward_native with forward_cuda in...
The weight_loader should be passed name not weight_name, otherwise it silently fails to load the weights in the MoE layer and its output is all zeros. This is the diff...
@mcharytoniuk @LaurentMazare sorry about that, my bad. Looks like I did not test the model I thought I was testing... I hope I will have time to take a look...