torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Remove `datasets` requirement and instead rely on download from `huggingface_hub`

Open joecummings opened this issue 2 years ago • 6 comments

We currently handle all dataset manipulation through our own dataset classes, all we are using datasets for is a convenient wrapper for downloading the dataset to disk. We should be able to just rely on huggingface_hub to do the same thing.

joecummings avatar Feb 07 '24 14:02 joecummings

@joecummings is this still something we plan to do?

kartikayk avatar Apr 21 '24 16:04 kartikayk

Yes, but low in priorities.

joecummings avatar Apr 22 '24 14:04 joecummings

To be clear, as it currently stands it's not possible to provide a custom dataset in the appropriate chat template/format in the config file? By custom dataset I mean one that is not listed on this page, but is available as a dataset on the Huggingface Hub, or locally, and follows the right template.

venkatasg avatar Apr 26 '24 21:04 venkatasg

Just to be clear, can I use my custom dataset, stored locally in the appropriate Alpaca format, by providing it in the config file to train the model? Specifically, can I override the Lora_70B.yaml config file with my custom dataset as shown below (dataset = my_custom_dataset ) or in any other way?

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora \
dataset = my_custom_dataset 

amitb-nuscale avatar May 02 '24 13:05 amitb-nuscale

@amitb-nuscale if you are trying to use a custom dataset builder function that returns a Dataset class in python code, then yes you can provide an override like you did (but without spaces):

dataset=module.path.to.my_custom_dataset

If you have a local file saved in the alpaca format that you want to use instead, then you need to override the source and data_files parameters:

dataset.source=csv dataset.data_files=my_data_file.csv

RdoubleA avatar May 02 '24 16:05 RdoubleA

@RdoubleA - May be I am missing something. But changing it into dataset.source=csv dataset.data_files=my_data_file.csv is generating an expected error - alpaca_dataset() got an unexpected keyword argument 'data_files'.

When I removed the torchtune.dataset.alpaca_dataset from the config.yaml file , It states that 'NoneType' object has no attribute 'split'.

amitb-nuscale avatar May 06 '24 03:05 amitb-nuscale

Hey @amitb-nuscale, are you still running into issues? We have since updated the tutorial on custom datasets that may address your problems. If you are still facing difficulties, would you be able to post a separate issue on it since it is unrelated? Thanks!

@joecummings @kartikayk closing this issue because based on offline discussions around iterable / streaming datasets, we will need to rely on HF Datasets APIs and load_dataset for more configuration around sharding / shuffling for iterable datasets.

RdoubleA avatar May 29 '24 13:05 RdoubleA