Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add data loader for HF oasst1

Open dwyatte opened this issue 1 year ago • 3 comments

Currently, data needs to be manually downloaded and a path specified via config to train on OpenAssistant datasets (both internal and https://huggingface.co/datasets/OpenAssistant/oasst1/blob/main/2023-04-12_oasst_ready.trees.jsonl.gz) so that the tree can be parsed by model_training.custom_datasets.oasst_dataset.load_oasst_export

We should write/refactor the data loader to work directly with the dataset returned by HF datasets.load_dataset("OpenAssistant/oasst1")

dwyatte avatar Apr 21 '23 22:04 dwyatte

We could split the message trees into 3 subsets : sft, rl and rm and make it friendly for the dataset viewer as well. Any suggestions?

theblackcat102 avatar Apr 22 '23 11:04 theblackcat102

can I work on this one?

grgau avatar Apr 22 '23 18:04 grgau

We could split the message trees into 3 subsets : sft, rl and rm and make it friendly for the dataset viewer as well. Any suggestions?

We have a super-tiny set of trees and IMO we cannot afford to split them into three independent groups.

andreaskoepf avatar Apr 23 '23 11:04 andreaskoepf