torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Support masking of partial dialogue in multi-turn chat datasets

Open jiatong-yu opened this issue 1 year ago • 2 comments

According to the documentation, it appears that Torchtune currently supports training using either all “assistant” | “user” content or all of “assistant” content in a multi-turn conversation. However, a common use case is training on a specific subset of responses, such as only the most recent “assistant” responses in a conversation.

What is the recommended approach for achieving this with Torchtune?

jiatong-yu avatar Dec 25 '24 17:12 jiatong-yu

hey @jiatong-yu , you should be able to write your own custom message_transform / dataset.

Here is our wiki: https://pytorch.org/torchtune/main/basics/message_transforms.html

Take a look at how its done in the chat dataset: chatdataset.https://github.com/pytorch/torchtune/blob/aa8f365f91a69aa36aaea14cf6f03ccd45310bb6/torchtune/datasets/_chat.py#L21

Then, in your config, you can pass:

tune run <recipe> <config> --config dataset._component_:path.to.my.custom.dataset

felipemello1 avatar Dec 26 '24 03:12 felipemello1

Here's some additional info: https://github.com/pytorch/torchtune/issues/2111#issuecomment-2519077960

calvinpelletier avatar Dec 26 '24 19:12 calvinpelletier