DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Does this program supports tensorboard?

Open Chevolier opened this issue 2 years ago • 2 comments

Does this program supports tensorboard? Could not find any logs of tensorbard.

Chevolier avatar Apr 21 '23 10:04 Chevolier

@Chevolier, can you please clarify the program you are referring to? It would be helpful to share what you are running and the expected output. Thanks!

tjruwase avatar Apr 21 '23 17:04 tjruwase

@Chevolier, can you please clarify the program you are referring to? It would be helpful to share what you are running and the expected output. Thanks!

I mean the the 3rd step, applications/DeepSpeed-Chat/training)/step3_rlhf_finetuning/. In particular, I run the multi-node/run_66b.sh using 2 nodes with 16 GPUs in total, and the model is bloomz-7b1, I can see the reward score in the standard output. But can I see the training process using tensorboard?

Chevolier avatar Apr 22 '23 00:04 Chevolier

Hi @Chevolier,

DeepSpeed has monitoring functionality built in and the monitor can be selected by specifying the corresponding configuration (TensorBoard, WandB, csv).

The documentation can be found here: https://www.deepspeed.ai/docs/config-json/#monitoring-module-tensorboard-wandb-csv

For TensorBoard, an example configuration may look like this:

"tensorboard": {
    "enabled": True,
    "output_path": "output/ds_logs/",
    "job_name": "train_bert"
}

The configuration can be added to the get_train_ds_config utility function found here: https://github.com/microsoft/DeepSpeedExamples/blob/dafeb2b3be3a085214faa2f59a8979c051424938/applications/DeepSpeed-Chat/training/utils/ds_utils.py#L32

Which will allow models that are initialized to have a monitor specified. Please let me know if you run into any problems with this method.

Thanks, Lev

lekurile avatar Jun 09 '23 17:06 lekurile

Hi @Chevolier,

Just wanted to update you that we have a PR to add various instrumentation across all the DS Chat steps, including tensorboard logging (GH-624)

Feel free to give it a try to see if it works on your end.

Thanks, Lev

lekurile avatar Jul 13 '23 22:07 lekurile