if no expert found in parameter that have expert in name the loop should continue
I have implemented some custom logic in the deeepspeed_moe classes and having "expert" in any parameter name breaks the saving function for checkpoints.
The warning triggers since the code founds an expert (by name) which is not one:
[WARNING] [engine.py:3597:_save_moe_checkpoint] No expert found in key transformer.layers.0.1.deepspeed_moe.gate.wg.experts_mask.
but as we do not continue the loop this error happens still:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
A simple continue fixes this :)
@stas00, FYI
@stas00, FYI
Convert this is to a draft because just continue is not sufficient, as the parameter is not saved at all in this case, so loading the model again than fails.
I haven't gotten to saving checkpoints yet, so I don't have the understanding of this code yet.
It's interesting someone is using this old implementation! @LckyLke, we are working on modernizing the original DS-MoE here https://github.com/snowflakedb/ArcticTraining/pull/272 - currently qwen3-moe and qwen3-next are supported - but no checkpoint saving yet... will come later.
@stas00 thanks for the info I will definitely check it out :) Maybe I can find a fix for my problem here over the weekend.