trl
trl copied to clipboard
[FEATURE REQUEST] Allow masking input loss when packing=True
Why
- Chatbot training is probably the most common single use-case for TRL.
- As such, it would be nice to give a developer the option to mask the loss of the inputs of the user so that a model can more quickly fit to its output.
- Obviously there are cases where masking inputs does not make much difference, but I can imagine that for tasks such as closed QA, for example, which have long user inputs and relatively short LLM outputs, that this could be beneficial.
- This functionality is already enabled with the DataCollatorForCompletionOnlyLM, but this does not support packing.
- Therefore, it would be nice to be able to do both packing and masking at the same time.
How
- Using the functions I outline here to mask the labels of each conversation.
- With a class like the
ConstantLengthDataset
, instead of tokenizing a piece of text from a defined column here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L460 apply thetokenize_messages(messages, tokenizer)
function to each row's conversation. - This will give input ids and labels, which should then both be treated the same. I.e. initialize a
all_labels = []
alongside theall_token_ids = []
in here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L463 and then just add values to it in the same way asall_token_ids
but from the computed labels.
Would this be possible to add into TRL?
Thanks