Result error

Open xuewenMei opened this issue 8 months ago • 0 comments

Hello, I have successfully resolved the validation issue. There is a problem with the deepspeed2hf weight script: it shards the weights, even though I increased the maximum sharding limit from 5GB to 30GB. I have refactored this part of the code and will submit a merge request later.

However, there is still one issue: my training results are consistently poor, whether I validate on the dataset or directly use the model for inference. I set up training with 10 runs of 500 steps each, and I also tried adjusting the lora_r parameter and increasing the batch_size, but none of these changes helped. It seems like the model isn't learning anything. Could you please help me identify where I might have made a mistake? Most of my settings are based on the default parameters. I look forward to your response!

Apr 20 '25 00:04 xuewenMei