Tian Lan issues

Results 10 issues of


                                            Tian Lan

why PPO needs to store action_log_probs instead of using stop_gradient for better efficiency?

Hi, I am looking at the PPO implementation, and I am curious about this part (actually many other implementations are using this workflow as well, so I am also curious...

inference of pre-trained model

Hi, I am very interested in the distributed inference of Colossal AI. Since we have pre-trained NLP models from Pytorch or JAX, I wonder if possible or what should be...

A few questions for README of stage 3 (RL section)

My questions are mostly for the stage 3, according to the doc https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/README.md it says that ``` If you don't have step 1 and step 2 models. You may simply...

memory usage of DPO trainer seems stepwise growing with time

Hi, I am DPO training a checkpoint of Mixtral-8x7B-Instruct, from the previous supervised finetune. I mainly followed this script https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py with 8 H100 GPUs, flash attn and deepspeed zero 2,...

cannot run `train_with_warp_drive`

I meet a few errors when running `train_with_warp_drive`, one is `from scripts.run_unittests import import_class_from_path` seem not good, it will complain `scripts` cannot be found in module, so I remove `scripts.`,...

Update README.md

two questions on parallel implementation in the batch-A3C

Hello, I looked at your bA3C code and learned a lot. Thank you so much for sharing your codebase implementing this idea. I have two questions mainly on your code....

Is fine tuning with e.g., LORA supported?

It looks like FSDP is a pretty awesome module to distribute the base model, but does this codebase support Lora fine tuning? I think usually what we would like DPO...

`local_rank` or `rank` for multi-node FSDP

I am wondering for multi-node FSDP, does `local_rank` and `rank` have any obvious difference here? I think I understand that `local_rank` is the rank within a node. I see in...

How to save model checkpoint from a distributed training from multiple nodes?

Hello, When I use accelerator and deepspeed Zero3 to train the model in one node with 8 GPUs, the following code smoothly saves the model checkpoint ``` ds_state_dict = model._zero3_consolidated_16bit_state_dict()...

Stale