DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

if no expert found in parameter that have expert in name the loop should continue

Open LckyLke opened this issue 1 month ago • 4 comments

I have implemented some custom logic in the deeepspeed_moe classes and having "expert" in any parameter name breaks the saving function for checkpoints.

The warning triggers since the code founds an expert (by name) which is not one:

[WARNING] [engine.py:3597:_save_moe_checkpoint] No expert found in key transformer.layers.0.1.deepspeed_moe.gate.wg.experts_mask.

but as we do not continue the loop this error happens still:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

A simple continue fixes this :)

LckyLke avatar Nov 11 '25 19:11 LckyLke

@stas00, FYI

sfc-gh-truwase avatar Nov 12 '25 19:11 sfc-gh-truwase

@stas00, FYI

Convert this is to a draft because just continue is not sufficient, as the parameter is not saved at all in this case, so loading the model again than fails.

LckyLke avatar Nov 12 '25 19:11 LckyLke

I haven't gotten to saving checkpoints yet, so I don't have the understanding of this code yet.

It's interesting someone is using this old implementation! @LckyLke, we are working on modernizing the original DS-MoE here https://github.com/snowflakedb/ArcticTraining/pull/272 - currently qwen3-moe and qwen3-next are supported - but no checkpoint saving yet... will come later.

stas00 avatar Nov 12 '25 20:11 stas00

@stas00 thanks for the info I will definitely check it out :) Maybe I can find a fix for my problem here over the weekend.

LckyLke avatar Nov 19 '25 11:11 LckyLke