DPO rewards stucks at zero
While finetuning Llama from an SFT model trained with lora config I get this type of behavior where both the rewards stay at 0 and the loss never goes down
15 {'loss': 0.6932, 'learning_rate': 1.3724429223744293e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -93.45967864990234, 'logps/chosen': -76.77323150634766, 'logits/rejected': -1.6269360780715942, 'logits/chosen': -1.6115689277648926, 'epoch': 0.38} 16 {'loss': 0.6932, 'learning_rate': 1.3670776255707763e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -96.17804718017578, 'logps/chosen': -74.35616302490234, 'logits/rejected': -1.6258790493011475, 'logits/chosen': -1.5984680652618408, 'epoch': 0.41} 17 {'loss': 0.6932, 'learning_rate': 1.3617123287671234e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -98.59229278564453, 'logps/chosen': -73.85718536376953, 'logits/rejected': -1.6299934387207031, 'logits/chosen': -1.606400489807129, 'epoch': 0.45} 18 {'loss': 0.6932, 'learning_rate': 1.3563470319634702e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -93.72036743164062, 'logps/chosen': -79.21975708007812, 'logits/rejected': -1.6140862703323364, 'logits/chosen': -1.5989283323287964, 'epoch': 0.49}
I used the following training arguments.
I tried with both fp16 and bf16
training_args = TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=32, remove_unused_columns=False, num_train_epochs=epochs, output_dir=save_dir, save_steps=1500, logging_first_step=True, logging_steps=5, learning_rate=1.41e-5, optim="rmsprop", warmup_steps=0, #bf16=True, #fp16=True, )
Hmm, that's a bit weird, cc @kashif if you have any idea not sure if this is a duplicate of https://github.com/huggingface/trl/issues/1236 - might be a HP issue? (that issue though points out about the loss, not the reward)
@pankayaraj Did you find a solution? I have the same bug.
Mine is at loss=0.6931 instead of .6932 and the logps/rejected goes down to crazy numbers near the end ('logps/rejected': -36880.9296875). I'm using Mistral 7B.
I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.
Great! Let me know how it goes!On 14. Feb 2024, at 22:31, Alexia Jolicoeur-Martineau @.***> wrote: I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@pankayaraj Did you find a solution? I have the same bug. My dpo training loss stucks at 0.6931.
@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.
@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.
HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results
@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.
HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results
I don't solve it now. What do you mean by "my first iteration i had got good results" ? Why does this indicate some mistake? Where do you think the mistake might come from?
@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.
HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results
Do you solve it?
I had the same issue, getting the exact same loss as OP had (0.6932) for a couple of epochs. I resolved it by upgrading trl version from 0.7.4 to 0.7.11 and also lowered the learning rate.
Hi, I was working on the reward modeling with the mistral in the past few weeks and encounter the same issue. The problem in my case is that the standard Chat-template prevent the models from handling multi-round chat if you set the pad token as the eon token. Instead, the following modification solves the problem.
tokenizer.add_special_tokens({'pad_token': '[PAD]'}) model.resize_token_embeddings(len(tokenizer))
Hope this helps!
Maybe you should check if your ref_model is a static reference copy of model. There is the following code snippet in DPOTrainer init():
if ref_model:
self.ref_model = ref_model
elif self.is_peft_model or precompute_ref_log_probs:
# The `model` with adapters turned off will be used as the reference model
self.ref_model = None
else:
self.ref_model = create_reference_model(model)
I fixed this problem by following what create_reference_model() does.
def create_reference_model(
model: PreTrainedModelWrapper, num_shared_layers: int = None, pattern: str = None
) -> PreTrainedModelWrapper:
"""
Creates a static reference copy of a model. Note that model will be in `.eval()` mode.
Args:
model (`PreTrainedModelWrapper`): The model to be copied.
num_shared_layers (`int`, *optional*): The number of initial layers that are shared between both models and kept frozen.
pattern (`str`, *optional*): The shared layers are selected with a string pattern
(e.g. "transformer.h.{layer}" for GPT2) and if a custom pattern is necessary it can be passed here.
Returns
`PreTrainedModelWrapper`
"""
if is_deepspeed_zero3_enabled():
raise ValueError(
"DeepSpeed ZeRO-3 is enabled and is not compatible with `create_reference_model()`. Please instantiate your reference model directly with `AutoCausalLM.from_pretrained()`."
)
parameter_names = [n for n, _ in model.named_parameters()]
ref_model = deepcopy(model)
# if no layers are shared, return copy of model
if num_shared_layers is None:
for param_name in parameter_names:
param = ref_model.get_parameter(param_name)
param.requires_grad = False
return ref_model.eval()
# identify layer name pattern
if pattern is not None:
pattern = pattern.format(layer=num_shared_layers)
else:
for pattern_candidate in LAYER_PATTERNS:
pattern_candidate = pattern_candidate.format(layer=num_shared_layers)
if any([pattern_candidate in name for name in parameter_names]):
pattern = pattern_candidate
break
if pattern is None:
raise ValueError("Layer pattern could not be matched.")
# divide parameters in shared and unshared parameter lists
shared_param_list = []
unshared_param_list = []
shared_parameter = True
for name, param in model.named_parameters():
if pattern in name:
shared_parameter = False
if shared_parameter:
shared_param_list.append(name)
else:
unshared_param_list.append(name)
# create reference of the original parameter if they are shared
for param_name in shared_param_list:
param = model.get_parameter(param_name)
param.requires_grad = False
ref_param = ref_model.get_parameter(param_name) # noqa
ref_param = param # noqa
# for all other parameters just make sure they don't use gradients
for param_name in unshared_param_list:
param = ref_model.get_parameter(param_name)
param.requires_grad = False
if pattern is not None and len(unshared_param_list) == 0:
logging.warning("Pattern passed or found, but no layers matched in the model. Check for a typo.")
return ref_model.eval()
From the discussions above and internally, this could be solved by tweaking hyper-parameters. Can you try to play a bit with the learning rate and let us know how it goes?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.
Thanks! It also works on my machine :)
@FangHainannn 's solution works. Whether we use create_reference_model() or copy.deepcopy(), or just leave the "ref_model" in DPOTrainer empty(if it is empty, huggingface will create a copy for you as the initial model), all work.
The mistake for me is I wrongly passed the "model" to the "ref_model" parameter, causing the 'model' and 'ref_model' to point to the same object.
@0nutation Did you solve this?