distilabel issues

[BUG] GPU utilization depends on targeted dataset size

2

**Describe the bug** Generating larger datasets with `LoadDataFromDicts` leads to underutilization of the GPU during the `TextGeneration` step. **To Reproduce** Setting `N_SAMPLES` to a small value in the code below...

fpreiss

[CI] Install `vllm` in CPU-mode to run tests

## Description In order to run the `vLLM` tests within the CI, we should be installing `vllm` in the CPU as per their official docs at https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html cc @gabrielmbmb

alvarobartt

dependencies

ci

[FEATURE] move `model_name` to `distilabel_metadata` dictionary with step name suffix

4

**Is your feature request related to a problem? Please describe.** Along with the addition of `raw_output_` and the proposed `raw_input_` (https://github.com/argilla-io/distilabel/issues/698). I think it would be nice to align this...

davidberenstein1957

enhancement

[FEATURE] Review `setup_logging` and add control flags

## Description Currently the `logging` handler created for the LLMs is named as `distilabel.llm.{llm.model_name}`, but the `llm.model_name` property shouldn't be used since it can be confusing or even expose a...

alvarobartt

improvement

Add OpenAI API Batch Processing Step

4

This PR adds a general step that enables the use of the OpenAI Batch API as discussed in #538. The Step follows roughly the same API as a Task but...

bjoernpl

[FEATURE] Cache at `Step` level

## Description TODO ### Ideas * Add `use_cache` flag at `Step` level to avoid caching

alvarobartt

improvement

Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases`

1

- went for lancedb because it works in memory. - @frascuchon as follow up we can consider adding argilla based on your vector search PR :) Do vector search using...

davidberenstein1957

Feat/954 llama cpp

8

This PR adds llama-cpp support to create embeddings. ``` from distilabel.embeddings import LlamaCppEmbeddings embeddings = LlamaCppEmbeddings(model="second-state/all-MiniLM-L6-v2-Q2_K.gguf") embeddings.load() results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"]) # [ # [-0.05447685346007347, -0.01623094454407692, ...],...

bikash119

fix inpute when output_mapping is not empty

fix impute output when the output_mapping is not empty

zye1996

[FEATURE] Add delay parameter to GeneratorTask

**Is your feature request related to a problem? Please describe.** When running a TextGeneration task on a big dataset using the OpenAI API, I'm getting the following error: `openai.RateLimitError: Error...

djellalmohamedaniss

distilabel
distilabel copied to clipboard

Metadata

[BUG] GPU utilization depends on targeted dataset size

[CI] Install `vllm` in CPU-mode to run tests

[FEATURE] move `model_name` to `distilabel_metadata` dictionary with step name suffix

[FEATURE] Review `setup_logging` and add control flags

Add OpenAI API Batch Processing Step

[FEATURE] Cache at `Step` level

Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases`

Feat/954 llama cpp

fix inpute when output_mapping is not empty

[FEATURE] Add delay parameter to GeneratorTask

← Metadata

Owner

Metadata

distilabel distilabel copied to clipboard

Metadata

← Metadata

Owner

Metadata

distilabel
distilabel copied to clipboard