trl
trl copied to clipboard
[FEATURE REQUEST] Allow masking input loss when packing=True
Why
- Chatbot training is probably the most common single use-case for TRL.
- As such, it would be nice to give a developer the option to mask the loss of the inputs of the user so that a model can more quickly fit to its output.
- Obviously there are cases where masking inputs does not make much difference, but I can imagine that for tasks such as closed QA, for example, which have long user inputs and relatively short LLM outputs, that this could be beneficial.
- This functionality is already enabled with the DataCollatorForCompletionOnlyLM, but this does not support packing.
- Therefore, it would be nice to be able to do both packing and masking at the same time.
How
- Using the functions I outline here to mask the labels of each conversation.
- With a class like the
ConstantLengthDataset, instead of tokenizing a piece of text from a defined column here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L460 apply thetokenize_messages(messages, tokenizer)function to each row's conversation. - This will give input ids and labels, which should then both be treated the same. I.e. initialize a
all_labels = []alongside theall_token_ids = []in here: https://github.com/huggingface/trl/blob/14e0d788078be6406e580a2e8aa94cd451e5f909/trl/trainer/utils.py#L463 and then just add values to it in the same way asall_token_idsbut from the computed labels.
Would this be possible to add into TRL?
Thanks
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Bump. I'd like to see this implemented as my own personal experience of training models such as RAG models using Axolotl definitely benefit from not training on inputs.
This makes intuitive sense as the number of tokens for most RAG models vastly outnumbers the inputs, meaning that the loss of the model answer is vastly diluted by training the model to language-model the inputs (which is pretty useless for a narrow-task model such as an RAG model).
If this feature could remain in consideration I'd be grateful!
I would also like to see this implemented– thanks!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
bump