composer icon indicating copy to clipboard operation
composer copied to clipboard

Support datasets from huggingface

Open SingL3 opened this issue 2 years ago • 3 comments
trafficstars

🚀 Feature Request

Now this package can load data from local path / http / s3, is there a plan to support huggingface datasets?

Motivation

Some datasets supply non-jsonl datasets, like parquet.

[Optional] Implementation

Additional context

SingL3 avatar Jul 06 '23 02:07 SingL3

The composer Trainer accepts an arbitrary train_dataloader, so I'm not sure what you mean here. Could you please clarify?

dakinggg avatar Jul 10 '23 22:07 dakinggg

As you can see here: https://github.com/mosaicml/composer/blob/f2a2dc820cb75023b9eb7c46fdfd25273712abd0/composer/datasets/in_context_learning_evaluation.py#L145 This mean users should be local and it does not support other format of data like parquet. The benefit of datasets may be it can download automatically if there is no local file.

SingL3 avatar Jul 11 '23 03:07 SingL3

Ah, got it! The code actually does automatically download from object store (https://github.com/mosaicml/composer/blob/ff59e862b92a7a1e62f72b57e36f528eb2c4bdfa/composer/datasets/in_context_learning_evaluation.py#L333), and the ICL classes expect the data to be in a particular format that probably isn't super common on the HF hub, but we can look more into supporting that!

dakinggg avatar Jul 23 '23 23:07 dakinggg