distilabel
distilabel copied to clipboard
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
**Is your feature request related to a problem? Please describe.** Both tasks seem to share a lot of logic so there is some code duplication. **Describe the solution you'd like**...
**Is your feature request related to a problem? Please describe.** We might suffer from downloading unneed large models. **Describe the solution you'd like** Something like this https://huggingface.co/distilabel-internal-testing/tiny-random-mistral was proposed by...
## Description Add a custom `Step` that runs `DSPy` even if it's only an example on how to use it via `distilabel` v1.0.0. The step could optimize a prompt from...
**Is your feature request related to a problem? Please describe.** Async is cool but debugging can be a pain. **Describe the solution you'd like** I would love to have synchronous...
Create a notebook showing an end2end workflow with distilabel to create a preference dataset based on a ~200-page economic document (IMF World Economic Outlook, April 2023). The preference dataset could...
## Which page or section is this issue related to? Currently the code snippet in the vLLM section of the guide (https://distilabel.argilla.io/latest/technical-reference/llms/#vllm) looks like: ```python llm = vLLM( model=LLM(model="argilla/notus-7b-v1"), task=TextGenerationTask(),...
## Description A high impact task for distilabel is one that generates follow up turns or multi-turn dialogues (which then can be criticized/ranked Given a conversation (or at least a...
**Is your feature request related to a problem? Please describe.** In [this PR](https://github.com/argilla-io/distilabel/pull/203), we introduced the `ChatTask` but we want to add as much information to the data we send...
The idea is to set up the Open In Colab and Open GitHub Source as a template overridden feature of the mkdocs template, that should be possible. We have some...
Our current preference pipelines work with the assumption of single-turn (instruction) datasets. To generate high-quality data preferences we need to support multi-turn data.