[FEATURE REQUEST] Allow masking input loss when packing=True

Open Peter-Devine opened this issue 11 months ago • 2 comments

Why

Chatbot training is probably the most common single use-case for TRL.
As such, it would be nice to give a developer the option to mask the loss of the inputs of the user so that a model can more quickly fit to its output.
Obviously there are cases where masking inputs does not make much difference, but I can imagine that for tasks such as closed QA, for example, which have long user inputs and relatively short LLM outputs, that this could be beneficial.
This functionality is already enabled with the DataCollatorForCompletionOnlyLM, but this does not support packing.
Therefore, it would be nice to be able to do both packing and masking at the same time.

Using the functions I outline here to mask the labels of each conversation.
With a class like the ConstantLengthDataset, instead of tokenizing a piece of text from a defined column here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L460 apply the tokenize_messages(messages, tokenizer) function to each row's conversation.
This will give input ids and labels, which should then both be treated the same. I.e. initialize a all_labels = [] alongside the all_token_ids = [] in here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L463 and then just add values to it in the same way as all_token_ids but from the computed labels.

Would this be possible to add into TRL?

Thanks

Mar 01 '24 09:03 Peter-Devine