distilabel
distilabel copied to clipboard
Load hub dataset modification to work offline and change to not streaming by default
Description
This PR modifies the default behaviour of LoadHubDataset
to use streaming=False
as default, and tries to fetch the column info from the cached dataset if found.
Closes #561, and also includes the functionality from PR #520:
from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
load_hub_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
batch_size=8,
num_examples=4,
pipeline=Pipeline(name="dataset-pipeline"),
)