Remove `datasets` requirement and instead rely on download from `huggingface_hub`
We currently handle all dataset manipulation through our own dataset classes, all we are using datasets for is a convenient wrapper for downloading the dataset to disk. We should be able to just rely on huggingface_hub to do the same thing.
@joecummings is this still something we plan to do?
Yes, but low in priorities.
To be clear, as it currently stands it's not possible to provide a custom dataset in the appropriate chat template/format in the config file? By custom dataset I mean one that is not listed on this page, but is available as a dataset on the Huggingface Hub, or locally, and follows the right template.
Just to be clear, can I use my custom dataset, stored locally in the appropriate Alpaca format, by providing it in the config file to train the model? Specifically, can I override the Lora_70B.yaml config file with my custom dataset as shown below (dataset = my_custom_dataset ) or in any other way?
tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora \
dataset = my_custom_dataset
@amitb-nuscale if you are trying to use a custom dataset builder function that returns a Dataset class in python code, then yes you can provide an override like you did (but without spaces):
dataset=module.path.to.my_custom_dataset
If you have a local file saved in the alpaca format that you want to use instead, then you need to override the source and data_files parameters:
dataset.source=csv dataset.data_files=my_data_file.csv
@RdoubleA - May be I am missing something. But changing it into dataset.source=csv dataset.data_files=my_data_file.csv is generating an expected error - alpaca_dataset() got an unexpected keyword argument 'data_files'.
When I removed the torchtune.dataset.alpaca_dataset from the config.yaml file , It states that 'NoneType' object has no attribute 'split'.
Hey @amitb-nuscale, are you still running into issues? We have since updated the tutorial on custom datasets that may address your problems. If you are still facing difficulties, would you be able to post a separate issue on it since it is unrelated? Thanks!
@joecummings @kartikayk closing this issue because based on offline discussions around iterable / streaming datasets, we will need to rely on HF Datasets APIs and load_dataset for more configuration around sharding / shuffling for iterable datasets.