DeepSpeedExamples wandb support and evaluation

wandb support and evaluation

Open DanqingZ opened this issue 1 year ago • 0 comments

Hello, I greatly appreciate the RLHF repository you have provided. Previously, I was using trlx, but after switching to this repository, my main concern is about experiment logging and evaluation.

Specifically, we do not have wandb set up for this repository. I attempted to utilize https://github.com/hwchase17/langchain/issues/2918, but it only logged the training loss, evaluation loss, and evaluation perplexity, since we don't have evaluation metrics set up.
Additionally, in terms of evaluation, we have access to https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/evaluation_scripts which provides some helpful scripts for conducting case studies. However, we do not have any quantitative evaluation results. I am aware that certain evaluation scores such as rough score may not effectively measure the superiority of RLHF-trained LLM, but having such evaluation results would be beneficial. Alternatively, are there any other methods to demonstrate that the step3 model is superior to the step1 model, other than eye balling?

Apr 23 '23 20:04 DanqingZ

DeepSpeedExamples DeepSpeedExamples copied to clipboard

wandb support and evaluation

DeepSpeedExamples
DeepSpeedExamples copied to clipboard