trl Has anyone face with problems that DPO rewards accuracy stuck at 0.5 and the loss stuck at 0.6 to 0.8?

I just reconstruct the order of my dataset, I make the triplet <prompt, chosen, rejected> whose prompt are similar to be put into a list (I want to add new loss function, but the problems can happen only with dpo loss), and change the _tokenize function and related padding function to read the data structure List[List[prompt,chosen,rejected]] Is that the similar prompt in the same batch affect the training process? When I don't put my data into a list but just use official function to read each data, the loss can decrease sharply in several iterator.

Similar prompt, for instance, such as the same passage only change the position of some sentences.

Oct 07 '24 17:10 Thewillman

The rewards accuracies just float around 0.5, which means the chosen rewards in some steps can smaller than the rejected rewards

Oct 07 '24 17:10 Thewillman

Sorry, but despite my best efforts I can't understand your question. You're talking about similar prompts in a list, about modifying the codebase without providing us with your modifications, about a new loss function, and about values that stagnate, and so on.

Can you try to put things more clearly, providing all the necessary information and only the necessary information? Like the code you're using, the dataset, the package version you're using, the training arguments, etc. In other words, what's needed to easily replicate what you're describing? See other issues for some references...

Oct 07 '24 17:10 qgallouedec

Sorry, but despite my best efforts I can't understand your question. You're talking about similar prompts in a list, about modifying the codebase without providing us with your modifications, about a new loss function, and about values that stagnate, and so on.

Can you try to put things more clearly, providing all the necessary information and only the necessary information? Like the code you're using, the dataset, the package version you're using, the training arguments, etc. In other words, what's needed to easily replicate what you're describing? See other issues for some references...

Thank you for your check. I'll try to explain in as much detail as possible. I constructed a [prompt, chosen, rejected] dataset as required, and now, in order to add a new loss, I categorized the data with similar prompts into several lists and let the dataloader read them. However, I found that the training results were very poor. After removing the new loss function, I discovered that the DPO loss itself was already performing poorly. I'm not sure if this is due to the change in the way the data is loaded, because after doing this, each batch during training contains several pieces of data with similar prompts. Here is the _tokenize function in dpotrainer.py I revised, maybe it can help for understand my question: `def _tokenize( features: Dict[str, List], tokenizer: PreTrainedTokenizerBase, args: DPOConfig, processor: Optional[Callable] = None, model: Optional[PreTrainedModel] = None, ) -> Dict[str, List]:

batch = defaultdict(list)
prompt_list = features["prompt_list"]
chosen_list = features["chosen_list"]
rejected_list = features["rejected_list"]
#batch_len = []
if model is None:
    chosen_tokens_list = []
    rejected_tokens_list = []
    prompt_tokens_list = []
    for idx in range(len(prompt_list)):
        prompt = prompt_list[idx]
        chosen = chosen_list[idx]
        rejected = rejected_list[idx]
        list_len = len(prompt_list)
        images = [None]*len(prompt)
        prompt_tokens = _process_prompt(prompt, processor, tokenizer, images)
        chosen_tokens = _process_answer(prompt, chosen, processor, tokenizer, images)
        rejected_tokens = _process_answer(prompt, rejected, processor, tokenizer, images)
        prompt_len_input_ids = _adjust_prompt_length(prompt_tokens, chosen_tokens, rejected_tokens)
        prompt_tokens, chosen_tokens, rejected_tokens = _add_special_tokens(
            tokenizer, prompt_len_input_ids, prompt_tokens, chosen_tokens, rejected_tokens
        )
        _truncate_tokens(chosen_tokens, rejected_tokens, prompt_tokens, args)
        prompt_tokens_list.append(prompt_tokens)
        chosen_tokens_list.append(chosen_tokens)
        rejected_tokens_list.append(rejected_tokens)
    _build_sequence_tokens(batch, chosen_tokens_list, args, "chosen")
    _build_sequence_tokens(batch, rejected_tokens_list, args, "rejected")
    _append_prompt_tokens_to_batch(batch, prompt_tokens_list)
else:
    for idx in range(len(prompt_list)):
        prompt_list_ = prompt_list[idx]
        chosen_list_ = chosen_list[idx]
        rejected_list_ = rejected_list[idx]
        _tokenize_encoder_decoder(
            batch, tokenizer, prompt_list_, chosen_list_, rejected_list_, args, model
        )
batch = dict(batch)
return batch`

Oct 07 '24 17:10 Thewillman

Sorry, but despite my best efforts I can't understand your question. You're talking about similar prompts in a list, about modifying the codebase without providing us with your modifications, about a new loss function, and about values that stagnate, and so on. Can you try to put things more clearly, providing all the necessary information and only the necessary information? Like the code you're using, the dataset, the package version you're using, the training arguments, etc. In other words, what's needed to easily replicate what you're describing? See other issues for some references...

Thank you for your check. I'll try to explain in as much detail as possible. I constructed a [prompt, chosen, rejected] dataset as required, and now, in order to add a new loss, I categorized the data with similar prompts into several lists and let the dataloader read them. However, I found that the training results were very poor. After removing the new loss function, I discovered that the DPO loss itself was already performing poorly. I'm not sure if this is due to the change in the way the data is loaded, because after doing this, each batch during training contains several pieces of data with similar prompts. Here is the _tokenize function in dpotrainer.py I revised, maybe it can help for understand my question: `def _tokenize( features: Dict[str, List], tokenizer: PreTrainedTokenizerBase, args: DPOConfig, processor: Optional[Callable] = None, model: Optional[PreTrainedModel] = None, ) -> Dict[str, List]:
batch = defaultdict(list)
prompt_list = features["prompt_list"]
chosen_list = features["chosen_list"]
rejected_list = features["rejected_list"]
#batch_len = []
if model is None:
    chosen_tokens_list = []
    rejected_tokens_list = []
    prompt_tokens_list = []
    for idx in range(len(prompt_list)):
        prompt = prompt_list[idx]
        chosen = chosen_list[idx]
        rejected = rejected_list[idx]
        list_len = len(prompt_list)
        images = [None]*len(prompt)
        prompt_tokens = _process_prompt(prompt, processor, tokenizer, images)
        chosen_tokens = _process_answer(prompt, chosen, processor, tokenizer, images)
        rejected_tokens = _process_answer(prompt, rejected, processor, tokenizer, images)
        prompt_len_input_ids = _adjust_prompt_length(prompt_tokens, chosen_tokens, rejected_tokens)
        prompt_tokens, chosen_tokens, rejected_tokens = _add_special_tokens(
            tokenizer, prompt_len_input_ids, prompt_tokens, chosen_tokens, rejected_tokens
        )
        _truncate_tokens(chosen_tokens, rejected_tokens, prompt_tokens, args)
        prompt_tokens_list.append(prompt_tokens)
        chosen_tokens_list.append(chosen_tokens)
        rejected_tokens_list.append(rejected_tokens)
    _build_sequence_tokens(batch, chosen_tokens_list, args, "chosen")
    _build_sequence_tokens(batch, rejected_tokens_list, args, "rejected")
    _append_prompt_tokens_to_batch(batch, prompt_tokens_list)
else:
    for idx in range(len(prompt_list)):
        prompt_list_ = prompt_list[idx]
        chosen_list_ = chosen_list[idx]
        rejected_list_ = rejected_list[idx]
        _tokenize_encoder_decoder(
            batch, tokenizer, prompt_list_, chosen_list_, rejected_list_, args, model
        )
batch = dict(batch)
return batch`

I check the code in my DIY dpotrainer.py and utils.py, the code related padding and dataloading seems not wrong. Here is the tensorboard result accuracy loss margins

Oct 07 '24 17:10 Thewillman

This could go up on r/programmerhorror

Sir, I too struggled to understand the problem. From what I gathered, I have to ask: Why would you group similar prompts together? There's probably no benefit to grouping them. If anything, you want them to be as diverse and random as possible to improve generalization.

Since I don't understand what you're doing with the tokenizer, it's likely that you're overcomplicating things and probably messing up the data structure. You may need to take a step back and approach it in a simpler way.

Oct 08 '24 18:10 August-murr