distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Results 168 distilabel issues
Sort by recently updated
recently updated
newest added

**Describe the bug** Generating larger datasets with `LoadDataFromDicts` leads to underutilization of the GPU during the `TextGeneration` step. **To Reproduce** Setting `N_SAMPLES` to a small value in the code below...

## Description In order to run the `vLLM` tests within the CI, we should be installing `vllm` in the CPU as per their official docs at https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html cc @gabrielmbmb

dependencies
ci

**Is your feature request related to a problem? Please describe.** Along with the addition of `raw_output_` and the proposed `raw_input_` (https://github.com/argilla-io/distilabel/issues/698). I think it would be nice to align this...

enhancement

## Description Currently the `logging` handler created for the LLMs is named as `distilabel.llm.{llm.model_name}`, but the `llm.model_name` property shouldn't be used since it can be confusing or even expose a...

improvement

This PR adds a general step that enables the use of the OpenAI Batch API as discussed in #538. The Step follows roughly the same API as a Task but...

## Description TODO ### Ideas * Add `use_cache` flag at `Step` level to avoid caching

improvement

- went for lancedb because it works in memory. - @frascuchon as follow up we can consider adding argilla based on your vector search PR :) Do vector search using...

This PR adds llama-cpp support to create embeddings. ``` from distilabel.embeddings import LlamaCppEmbeddings embeddings = LlamaCppEmbeddings(model="second-state/all-MiniLM-L6-v2-Q2_K.gguf") embeddings.load() results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"]) # [ # [-0.05447685346007347, -0.01623094454407692, ...],...

fix impute output when the output_mapping is not empty

**Is your feature request related to a problem? Please describe.** When running a TextGeneration task on a big dataset using the OpenAI API, I'm getting the following error: `openai.RateLimitError: Error...