DeepSpeedExamples step3 answer is not correct

step3 answer is not correct

Open BaiStone2017 opened this issue 1 year ago • 5 comments

setting as follow: disable HE (HE + zero2 occur error) pp_epochs=1 num_train_epochs=1 disable_actor_dropout per_device_train_batch_size and per_device_mini_train_batch_size are 2

actor loss inference demo：

while actor_ema seems normal but it's effect is same with sft model

May 25 '23 01:05 BaiStone2017

I met the same error. The SFT model trained in step1 seems to be normal and effective in evaluation, but in step 3 it generates highly repetitive results in inference stage. I have found that when you disable the "enable_hybrid_engine", the error will be solved. This error may appear because the bug in hybrid engine.

Jun 01 '23 16:06 Alexandra9898

Mark

Jun 30 '23 09:06 Junyiliu0

I met the same error, and I found that the ema model has the same weight as my SFT model while actor model saved well with different weights I guess thats why it's effect is same with sft model...

Nov 07 '23 08:11 SupercarryNg

Nov 10 '23 02:11 Luoxiaohei41

Mark

Nov 14 '23 12:11 zmzhang2000

DeepSpeedExamples DeepSpeedExamples copied to clipboard

step3 answer is not correct

DeepSpeedExamples
DeepSpeedExamples copied to clipboard