Open-Assistant
Open-Assistant copied to clipboard
Add data loader for HF oasst1
Currently, data needs to be manually downloaded and a path specified via config to train on OpenAssistant datasets (both internal and https://huggingface.co/datasets/OpenAssistant/oasst1/blob/main/2023-04-12_oasst_ready.trees.jsonl.gz) so that the tree can be parsed by model_training.custom_datasets.oasst_dataset.load_oasst_export
We should write/refactor the data loader to work directly with the dataset returned by HF datasets.load_dataset("OpenAssistant/oasst1")
We could split the message trees into 3 subsets : sft, rl and rm and make it friendly for the dataset viewer as well. Any suggestions?
can I work on this one?
We could split the message trees into 3 subsets : sft, rl and rm and make it friendly for the dataset viewer as well. Any suggestions?
We have a super-tiny set of trees and IMO we cannot afford to split them into three independent groups.