Eike Steffen Kohlmeyer comments

Results 9 comments of


                                            Eike Steffen Kohlmeyer

how to use zero-offload?

Hey @jeffra 🙂, is this not implemented only for the hybrid engine, i.e. would it work as expected if I disable Hybrid Engine and remove the assertion from the `main.py`...

Missing key(s) in state_dict for bias in attention blocks

Adding the parameter `strict=False` parameter in line 70 of `utils.model.model_utils.create_critic_model` lets me load the checkpoint: ``` def create_critic_model(model_name_or_path, tokenizer, ds_config, num_padding_at_beginning=0, rlhf_training=False): # OPT model family always put a padding...

Step 3: RuntimeError: CUDA error: misaligned address

Hey @XiaoLaoDi not yet, but here is what I tried so far: - Use a **machine with more VRAM** that should definitely be able to fit the model, to rule...

Step 3: RuntimeError: CUDA error: misaligned address

@XiaoLaoDi How does your setup look like? Maybe we can identify similiarities and possible problem areas

Step 3: RuntimeError: CUDA error: misaligned address

@ruihan0495 thank you for the info. Not using the DeepSpeed-HE does indeed make a training possible 🙂. I run into another exception a little later in the code, but that...

Step 3: RuntimeError: CUDA error: misaligned address

@DehongXu tbh I didn't use Deepspeed RLHF in a while, but I remember that there was a known issue with the hybrid engine that was supposed to be fixed in...

step 3 : OOM

Hey @MAJIN123, what are the actor / critic model architectures?

Issues with accelerate and deepspeed training

Hi @swang99, Did you also not set `deepspeed_enable` to True in the config.yaml? Because this has to be done. If so, can you please share the config.yaml as well as...

Issues with accelerate and deepspeed training

@swang99 was your issue resolved with the latest repo state? If not, could you please share your nvidia-smi output? And you could also try zero optimization stage 3 (https://www.deepspeed.ai/tutorials/zero/#zero-overview)