distilabel
distilabel copied to clipboard
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
## Description This PR modifies the default behaviour of `LoadHubDataset` to use `streaming=False` as default, and tries to fetch the column info from the cached dataset if found. Closes #561,...
This draft PR proposes a way of using a DSPy prediction module as a text generation step. The advantage of this is that text generation could use an optimised, evaluated,...
## Description WIP
## Description This PR aligns the kwargs for some of the implemented `LLM` subclasses, based on their engine counterparts, so that all the kwargs can be provided to the `LLM`...
**Is your feature request related to a problem? Please describe.** I appreciate the work distilabel is doing and making it easier for the community to produce high quality datasets.Thank you!...
**Describe the bug** Apparently, the cache location is different in the `Pipeline.run` method before and after calling the `super().run`, since the signature is updated, and it modifies the path, so...
## Description Make the `_Step.model_post_init` less strict to allow instantiating steps without a `Pipeline` and throwing a warning instead of raising a `ValueError`. This should simplify testing steps without the...
## Description See milestone https://github.com/argilla-io/distilabel/milestone/8
**Is your feature request related to a problem? Please describe.** In the generated dataset we're saying rate following the annotation guidelines but they are empty. **Describe the solution you'd like**...
I want to push my results to huggingface with frequency 2000, like in distilabel 0.6.0: ``` freq = 2000 dataset_checkpoint = DatasetCheckpoint(path=Path.cwd() / "checkpoint_folder_evol_cn", save_frequency=freq, strategy = 'hf-hub', extra_kwargs={"repo_id": 'xxx/xxx',...