Eike Steffen Kohlmeyer
Eike Steffen Kohlmeyer
Hey @jeffra 🙂, is this not implemented only for the hybrid engine, i.e. would it work as expected if I disable Hybrid Engine and remove the assertion from the `main.py`...
Adding the parameter `strict=False` parameter in line 70 of `utils.model.model_utils.create_critic_model` lets me load the checkpoint: ``` def create_critic_model(model_name_or_path, tokenizer, ds_config, num_padding_at_beginning=0, rlhf_training=False): # OPT model family always put a padding...
Hey @XiaoLaoDi not yet, but here is what I tried so far: - Use a **machine with more VRAM** that should definitely be able to fit the model, to rule...
@XiaoLaoDi How does your setup look like? Maybe we can identify similiarities and possible problem areas
@ruihan0495 thank you for the info. Not using the DeepSpeed-HE does indeed make a training possible 🙂. I run into another exception a little later in the code, but that...
@DehongXu tbh I didn't use Deepspeed RLHF in a while, but I remember that there was a known issue with the hybrid engine that was supposed to be fixed in...
Hey @MAJIN123, what are the actor / critic model architectures?
Hi @swang99, Did you also not set `deepspeed_enable` to True in the config.yaml? Because this has to be done. If so, can you please share the config.yaml as well as...
@swang99 was your issue resolved with the latest repo state? If not, could you please share your nvidia-smi output? And you could also try zero optimization stage 3 (https://www.deepspeed.ai/tutorials/zero/#zero-overview)