kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

Better HuggingfaceDataset Support

Open lordsoffallen opened this issue 7 months ago • 4 comments

Description

Huggingface provides storage solutions similar to s3 to make it easy to upload files to either datasets or models.

Context

I would like to make use of this as it would make it easy to upload files to huggingface for people who don't use cloud solutions or experimenting with stuff.

Possible Implementation

Extend the current HFDataset class to support save method to upload files to a dataset/git repo in huggingface. Providing customization around token management to pass via credentials.yml if not set in the env.

Furthermore, current class is limited to loading the whole dataset and relying some kwargs. I think it would be very nice if this also works similar to how we read and write data to s3 with paths and kedro would send the relevant command to fetch that file.

lordsoffallen avatar Apr 14 '25 08:04 lordsoffallen

Hey @lordsoffallen, thanks for the issue! Would you be interested in working on this? Would it make sense to enhance the current dataset or create a new one, perhaps in kedro_datasets_experimental?

ankatiyar avatar Apr 16 '25 13:04 ankatiyar

Hey @lordsoffallen, thanks for the issue! Would you be interested in working on this? Would it make sense to enhance the current dataset or create a new one, perhaps in kedro_datasets_experimental?

I think current one only does loading of a huggingface dataset, does not support save operation similar to s3 functionality under the hood.

I think this could maybe a new one like HFStorage to push files there.

If I end up writing one soon, i can push a pr, don't have a specific timeline for now so I thought I put it here first.

lordsoffallen avatar Apr 17 '25 13:04 lordsoffallen

xref to some improvements @lordsoffallen suggested a while back https://github.com/kedro-org/kedro-plugins/pull/612

One question: when you say "Huggingface provides storage solutions similar to s3 to make it easy to upload files to either datasets or models", do you mean something like

filepath: hf://whatever

?

like the custom transport Polars introduced? https://pola.rs/posts/polars-hugging-face/

astrojuanlu avatar May 26 '25 14:05 astrojuanlu

@astrojuanlu yes, since we can upload datasets in different folder structure, i think similar to that style would be super nice. It is very useful to push data to hf and then use cloud platform to access it.

lordsoffallen avatar May 26 '25 17:05 lordsoffallen