trl icon indicating copy to clipboard operation
trl copied to clipboard

[FEATURE REQUEST] Allow masking input loss when packing=True

Open Peter-Devine opened this issue 11 months ago • 2 comments

Why

  • Chatbot training is probably the most common single use-case for TRL.
  • As such, it would be nice to give a developer the option to mask the loss of the inputs of the user so that a model can more quickly fit to its output.
  • Obviously there are cases where masking inputs does not make much difference, but I can imagine that for tasks such as closed QA, for example, which have long user inputs and relatively short LLM outputs, that this could be beneficial.
  • This functionality is already enabled with the DataCollatorForCompletionOnlyLM, but this does not support packing.
  • Therefore, it would be nice to be able to do both packing and masking at the same time.

How

  • Using the functions I outline here to mask the labels of each conversation.
  • With a class like the ConstantLengthDataset, instead of tokenizing a piece of text from a defined column here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L460 apply the tokenize_messages(messages, tokenizer) function to each row's conversation.
  • This will give input ids and labels, which should then both be treated the same. I.e. initialize a all_labels = [] alongside the all_token_ids = [] in here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L463 and then just add values to it in the same way as all_token_ids but from the computed labels.

Would this be possible to add into TRL?

Thanks

Peter-Devine avatar Mar 01 '24 09:03 Peter-Devine