kedro-plugins
kedro-plugins copied to clipboard
Better HuggingfaceDataset Support
Description
Huggingface provides storage solutions similar to s3 to make it easy to upload files to either datasets or models.
Context
I would like to make use of this as it would make it easy to upload files to huggingface for people who don't use cloud solutions or experimenting with stuff.
Possible Implementation
Extend the current HFDataset class to support save method to upload files to a dataset/git repo in huggingface. Providing customization around token management to pass via credentials.yml if not set in the env.
Furthermore, current class is limited to loading the whole dataset and relying some kwargs. I think it would be very nice if this also works similar to how we read and write data to s3 with paths and kedro would send the relevant command to fetch that file.
Hey @lordsoffallen, thanks for the issue! Would you be interested in working on this? Would it make sense to enhance the current dataset or create a new one, perhaps in kedro_datasets_experimental?
Hey @lordsoffallen, thanks for the issue! Would you be interested in working on this? Would it make sense to enhance the current dataset or create a new one, perhaps in
kedro_datasets_experimental?
I think current one only does loading of a huggingface dataset, does not support save operation similar to s3 functionality under the hood.
I think this could maybe a new one like HFStorage to push files there.
If I end up writing one soon, i can push a pr, don't have a specific timeline for now so I thought I put it here first.
xref to some improvements @lordsoffallen suggested a while back https://github.com/kedro-org/kedro-plugins/pull/612
One question: when you say "Huggingface provides storage solutions similar to s3 to make it easy to upload files to either datasets or models", do you mean something like
filepath: hf://whatever
?
like the custom transport Polars introduced? https://pola.rs/posts/polars-hugging-face/
@astrojuanlu yes, since we can upload datasets in different folder structure, i think similar to that style would be super nice. It is very useful to push data to hf and then use cloud platform to access it.