trl icon indicating copy to clipboard operation
trl copied to clipboard

[FEATURE REQUEST] Allow masking input loss when packing=True

Open Peter-Devine opened this issue 1 year ago • 2 comments

Why

  • Chatbot training is probably the most common single use-case for TRL.
  • As such, it would be nice to give a developer the option to mask the loss of the inputs of the user so that a model can more quickly fit to its output.
  • Obviously there are cases where masking inputs does not make much difference, but I can imagine that for tasks such as closed QA, for example, which have long user inputs and relatively short LLM outputs, that this could be beneficial.
  • This functionality is already enabled with the DataCollatorForCompletionOnlyLM, but this does not support packing.
  • Therefore, it would be nice to be able to do both packing and masking at the same time.

How

  • Using the functions I outline here to mask the labels of each conversation.
  • With a class like the ConstantLengthDataset, instead of tokenizing a piece of text from a defined column here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L460 apply the tokenize_messages(messages, tokenizer) function to each row's conversation.
  • This will give input ids and labels, which should then both be treated the same. I.e. initialize a all_labels = [] alongside the all_token_ids = [] in here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L463 and then just add values to it in the same way as all_token_ids but from the computed labels.

Would this be possible to add into TRL?

Thanks

Peter-Devine avatar Mar 01 '24 09:03 Peter-Devine

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Mar 31 '24 15:03 github-actions[bot]

Bump. I'd like to see this implemented as my own personal experience of training models such as RAG models using Axolotl definitely benefit from not training on inputs.

This makes intuitive sense as the number of tokens for most RAG models vastly outnumbers the inputs, meaning that the loss of the model answer is vastly diluted by training the model to language-model the inputs (which is pretty useless for a narrow-task model such as an RAG model).

If this feature could remain in consideration I'd be grateful!

Peter-Devine avatar Apr 01 '24 01:04 Peter-Devine

I would also like to see this implemented– thanks!

geoffreyangus avatar Apr 11 '24 20:04 geoffreyangus

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar May 06 '24 15:05 github-actions[bot]

bump

pzdkn avatar Aug 18 '24 08:08 pzdkn