trl DPO rewards stucks at zero

While finetuning Llama from an SFT model trained with lora config I get this type of behavior where both the rewards stay at 0 and the loss never goes down

15 {'loss': 0.6932, 'learning_rate': 1.3724429223744293e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -93.45967864990234, 'logps/chosen': -76.77323150634766, 'logits/rejected': -1.6269360780715942, 'logits/chosen': -1.6115689277648926, 'epoch': 0.38} 16 {'loss': 0.6932, 'learning_rate': 1.3670776255707763e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -96.17804718017578, 'logps/chosen': -74.35616302490234, 'logits/rejected': -1.6258790493011475, 'logits/chosen': -1.5984680652618408, 'epoch': 0.41} 17 {'loss': 0.6932, 'learning_rate': 1.3617123287671234e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -98.59229278564453, 'logps/chosen': -73.85718536376953, 'logits/rejected': -1.6299934387207031, 'logits/chosen': -1.606400489807129, 'epoch': 0.45} 18 {'loss': 0.6932, 'learning_rate': 1.3563470319634702e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -93.72036743164062, 'logps/chosen': -79.21975708007812, 'logits/rejected': -1.6140862703323364, 'logits/chosen': -1.5989283323287964, 'epoch': 0.49}

I used the following training arguments.

I tried with both fp16 and bf16

training_args = TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=32, remove_unused_columns=False, num_train_epochs=epochs, output_dir=save_dir, save_steps=1500, logging_first_step=True, logging_steps=5, learning_rate=1.41e-5, optim="rmsprop", warmup_steps=0, #bf16=True, #fp16=True, )

Feb 02 '24 03:02 pankayaraj

Hmm, that's a bit weird, cc @kashif if you have any idea not sure if this is a duplicate of https://github.com/huggingface/trl/issues/1236 - might be a HP issue? (that issue though points out about the loss, not the reward)

Feb 02 '24 07:02 younesbelkada

@pankayaraj Did you find a solution? I have the same bug.

Mine is at loss=0.6931 instead of .6932 and the logps/rejected goes down to crazy numbers near the end ('logps/rejected': -36880.9296875). I'm using Mistral 7B.

Feb 12 '24 19:02 AlexiaJM

I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.

Feb 14 '24 21:02 AlexiaJM

Great! Let me know how it goes!On 14. Feb 2024, at 22:31, Alexia Jolicoeur-Martineau @.***> wrote: I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Feb 14 '24 22:02 kashif

@pankayaraj Did you find a solution? I have the same bug. My dpo training loss stucks at 0.6931.

Feb 29 '24 05:02 0nutation

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

Feb 29 '24 05:02 0nutation

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results

Mar 01 '24 02:03 jayachandrakalakutagar

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results

I don't solve it now. What do you mean by "my first iteration i had got good results" ? Why does this indicate some mistake? Where do you think the mistake might come from?

Mar 01 '24 04:03 0nutation

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results

Do you solve it?

Mar 01 '24 06:03 0nutation

I had the same issue, getting the exact same loss as OP had (0.6932) for a couple of epochs. I resolved it by upgrading trl version from 0.7.4 to 0.7.11 and also lowered the learning rate.

Mar 11 '24 15:03 fc2869

Hi, I was working on the reward modeling with the mistral in the past few weeks and encounter the same issue. The problem in my case is that the standard Chat-template prevent the models from handling multi-round chat if you set the pad token as the eon token. Instead, the following modification solves the problem.

tokenizer.add_special_tokens({'pad_token': '[PAD]'}) model.resize_token_embeddings(len(tokenizer))

Hope this helps!

Mar 23 '24 22:03 WeiXiongUST

Maybe you should check if your ref_model is a static reference copy of model. There is the following code snippet in DPOTrainer init():

        if ref_model:
            self.ref_model = ref_model
        elif self.is_peft_model or precompute_ref_log_probs:
            # The `model` with adapters turned off will be used as the reference model
            self.ref_model = None
        else:
            self.ref_model = create_reference_model(model)

I fixed this problem by following what create_reference_model() does.


def create_reference_model(
    model: PreTrainedModelWrapper, num_shared_layers: int = None, pattern: str = None
) -> PreTrainedModelWrapper:
    """
    Creates a static reference copy of a model. Note that model will be in `.eval()` mode.

    Args:
        model (`PreTrainedModelWrapper`): The model to be copied.
        num_shared_layers (`int`, *optional*): The number of initial layers that are shared between both models and kept frozen.
        pattern (`str`, *optional*): The shared layers are selected with a string pattern
            (e.g. "transformer.h.{layer}" for GPT2) and if a custom pattern is necessary it can be passed here.

    Returns
        `PreTrainedModelWrapper`
    """
    if is_deepspeed_zero3_enabled():
        raise ValueError(
            "DeepSpeed ZeRO-3 is enabled and is not compatible with `create_reference_model()`. Please instantiate your reference model directly with `AutoCausalLM.from_pretrained()`."
        )

    parameter_names = [n for n, _ in model.named_parameters()]
    ref_model = deepcopy(model)

    # if no layers are shared, return copy of model
    if num_shared_layers is None:
        for param_name in parameter_names:
            param = ref_model.get_parameter(param_name)
            param.requires_grad = False
        return ref_model.eval()

    # identify layer name pattern
    if pattern is not None:
        pattern = pattern.format(layer=num_shared_layers)
    else:
        for pattern_candidate in LAYER_PATTERNS:
            pattern_candidate = pattern_candidate.format(layer=num_shared_layers)
            if any([pattern_candidate in name for name in parameter_names]):
                pattern = pattern_candidate
                break

    if pattern is None:
        raise ValueError("Layer pattern could not be matched.")

    # divide parameters in shared and unshared parameter lists
    shared_param_list = []
    unshared_param_list = []

    shared_parameter = True
    for name, param in model.named_parameters():
        if pattern in name:
            shared_parameter = False
        if shared_parameter:
            shared_param_list.append(name)
        else:
            unshared_param_list.append(name)

    # create reference of the original parameter if they are shared
    for param_name in shared_param_list:
        param = model.get_parameter(param_name)
        param.requires_grad = False

        ref_param = ref_model.get_parameter(param_name)  # noqa
        ref_param = param  # noqa

    # for all other parameters just make sure they don't use gradients
    for param_name in unshared_param_list:
        param = ref_model.get_parameter(param_name)
        param.requires_grad = False

    if pattern is not None and len(unshared_param_list) == 0:
        logging.warning("Pattern passed or found, but no layers matched in the model. Check for a typo.")

    return ref_model.eval()

Apr 12 '24 12:04 FangHainannn

From the discussions above and internally, this could be solved by tweaking hyper-parameters. Can you try to play a bit with the learning rate and let us know how it goes?

May 23 '24 09:05 younesbelkada

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Jun 17 '24 15:06 github-actions[bot]

I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.

Thanks! It also works on my machine :)

Jul 31 '24 13:07 pppa2019

@FangHainannn 's solution works. Whether we use create_reference_model() or copy.deepcopy(), or just leave the "ref_model" in DPOTrainer empty(if it is empty, huggingface will create a copy for you as the initial model), all work.

The mistake for me is I wrongly passed the "model" to the "ref_model" parameter, causing the 'model' and 'ref_model' to point to the same object.

Aug 02 '24 04:08 lixiaochuan2020

@0nutation Did you solve this?

Sep 05 '24 13:09 kencyshaka