composer
composer copied to clipboard
Support datasets from huggingface
🚀 Feature Request
Now this package can load data from local path / http / s3, is there a plan to support huggingface datasets?
Motivation
Some datasets supply non-jsonl datasets, like parquet.
[Optional] Implementation
Additional context
The composer Trainer accepts an arbitrary train_dataloader, so I'm not sure what you mean here. Could you please clarify?
As you can see here:
https://github.com/mosaicml/composer/blob/f2a2dc820cb75023b9eb7c46fdfd25273712abd0/composer/datasets/in_context_learning_evaluation.py#L145
This mean users should be local and it does not support other format of data like parquet.
The benefit of datasets may be it can download automatically if there is no local file.
Ah, got it! The code actually does automatically download from object store (https://github.com/mosaicml/composer/blob/ff59e862b92a7a1e62f72b57e36f528eb2c4bdfa/composer/datasets/in_context_learning_evaluation.py#L333), and the ICL classes expect the data to be in a particular format that probably isn't super common on the HF hub, but we can look more into supporting that!