OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

Datasets from Huggingface

Open dxqb opened this issue 8 months ago • 2 comments

This has been an early request during cloud development, that (large) datasets can be downloaded by the cloud trainer from other cloud storage directly, to avoid having to upload your dataset from local to cloud: https://github.com/dxqbYD/OneTrainer/issues/5

Huggingface has a nice interface to upload datasets: grafik

It provides 300 GB of free storage, datasets can be private, and it provides an API that is already used by OT. So it is the ideal cloud storage for our datasets.

Instead of uploading (large) dataset to the cloud each training, it is uploaded only once to huggingface, and then downloaded by the cloud trainer from Huggingface very quickly: Fetching 2001 files: 100%|██████████| 2001/2001 [00:36<00:00, 55.42it/s]

I've tried to include good support for local use, too, such as previews and statistics, so this feature isn't broken when you use it locally. But of course there is less reason to use it when you only run locally.

dxqb avatar Apr 28 '25 16:04 dxqb

@efhosci

dxqb avatar Apr 28 '25 16:04 dxqb

I'll look through this later and maybe experiment with cloud data. Some of the basic statistics (image count, image/mask/caption pairing) may be possible without downloading everything if you can just check whether the file exists on a cloud path. Information about resolution, bucketing, etc is more important for smaller datasets. With preview images you may be able to download just a few locally so that image variations can be experimented with and the content can be identified from the thumbnail.

efhosci avatar Apr 28 '25 20:04 efhosci