Support masking of partial dialogue in multi-turn chat datasets
According to the documentation, it appears that Torchtune currently supports training using either all “assistant” | “user” content or all of “assistant” content in a multi-turn conversation. However, a common use case is training on a specific subset of responses, such as only the most recent “assistant” responses in a conversation.
What is the recommended approach for achieving this with Torchtune?
hey @jiatong-yu , you should be able to write your own custom message_transform / dataset.
Here is our wiki: https://pytorch.org/torchtune/main/basics/message_transforms.html
Take a look at how its done in the chat dataset: chatdataset.https://github.com/pytorch/torchtune/blob/aa8f365f91a69aa36aaea14cf6f03ccd45310bb6/torchtune/datasets/_chat.py#L21
Then, in your config, you can pass:
tune run <recipe> <config> --config dataset._component_:path.to.my.custom.dataset
Here's some additional info: https://github.com/pytorch/torchtune/issues/2111#issuecomment-2519077960