distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

Load hub dataset modification to work offline and change to not streaming by default

Open plaguss opened this issue 10 months ago • 0 comments

Description

This PR modifies the default behaviour of LoadHubDataset to use streaming=False as default, and tries to fetch the column info from the cached dataset if found.

Closes #561, and also includes the functionality from PR #520:

from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset

load_hub_dataset = LoadHubDataset(
    name="load_dataset",
    repo_id="HuggingFaceH4/instruction-dataset",
    split="test",
    batch_size=8,
    num_examples=4,
    pipeline=Pipeline(name="dataset-pipeline"),
)

plaguss avatar Apr 24 '24 08:04 plaguss