lorax issues

Add custom Grafana dashboard

1

tgaddair

enhancement

good first issue

AutoTokenzier.from_pretrains needs setting with `trust_remote_code` inside `load_module_map`

2

### System Info lorax main ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [ ] An officially supported command - [ ] My own...

thincal

Inference with AWQ quantized base model + compile enabled results in the <unk> tokens

1

### System Info lorax: v0.9.0 awq: main branch transformers: v4.39.3 ### Information - [ ] Docker - [ ] The CLI directly ### Tasks - [ ] An officially supported...

thincal

bug

Add all launcher args as optional in the Helm charts

Currently we only support a subset of LoRAX launcher args. We should support all of them as optional overrides: https://github.com/predibase/lorax/blob/main/charts/lorax/templates/deployment.yaml#L35

tgaddair

enhancement

Retrieve all lora models from Huggingface hub by base model setting.

2

### Feature request Retrieve all lora models from Huggingface hub by base model setting. such as collect all lora based on meta-llama/Meta-Llama-3-8B ### Motivation If I want to take a...

svjack

enhancement

good first issue

Supporting LmHead and Embedding Layers for Adapters

2

### System Info Doesn't work if you make changes to the vocab ### Information - [ ] Docker - [ ] The CLI directly ### Tasks - [ ] An...

magdyksaleh

enhancement

mixtral adapters returning broadcast shape error

4

### System Info lorax version: `4c39e8a` ### Information - When prompting Mixtral with adapter got the following error: `Request failed during generation: Server error: output with shape [1, 32000] doesn't...

noah-yoshida

bug

Fallback to Flash Attention v1 for pre-Ampere GPUs

1

We can add back the FA1 implementation from https://github.com/huggingface/text-generation-inference/pull/624 when compute capability of Volta or Turing is detected. This may bloat the Docker somewhat to support both, but it seems...

tgaddair

enhancement

good first issue

Improve async load for adapters to avoid main thread lockups in server

The concurrency currently assumes that host execution time that holds the GIL is minimal, but particularly for loading adapters from disk to host memory, we see that large adapters can...

tgaddair

enhancement

Using Source = Local for Base Model

6

### Feature request I only see source=local available for the adapters, is this the case? Even with the models cached/pointing to it locally, there is still a callout to HF...

silveranalytics

enhancement

lorax
lorax copied to clipboard

Metadata

Add custom Grafana dashboard

AutoTokenzier.from_pretrains needs setting with `trust_remote_code` inside `load_module_map`

Inference with AWQ quantized base model + compile enabled results in the <unk> tokens

Add all launcher args as optional in the Helm charts

Retrieve all lora models from Huggingface hub by base model setting.

Supporting LmHead and Embedding Layers for Adapters

mixtral adapters returning broadcast shape error

Fallback to Flash Attention v1 for pre-Ampere GPUs

Improve async load for adapters to avoid main thread lockups in server

Using Source = Local for Base Model

← Metadata

Owner

Metadata

lorax lorax copied to clipboard

Metadata

← Metadata

Owner

Metadata

lorax
lorax copied to clipboard