lorax Unexpected response with long-context model (Phi-3)

Unexpected response with long-context model (Phi-3)

Open prd-tuong-nguyen opened this issue 1 year ago • 0 comments

trafficstars

System Info

ghcr.io/predibase/lorax:f1ef0ee

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

  --port 80 \
  --model-id microsoft/Phi-3-mini-128k-instruct \
  --cuda-memory-fraction 0.8 \
  --sharded false \
  --max-waiting-tokens 20 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --hostname 0.0.0.0 \
  --max-concurrent-requests 512 \
  --max-best-of 1  \
  --max-batch-prefill-tokens $BATCH_TOKEN \
  --max-active-adapters 10 \
  --adapter-source local \
  --adapter-cycle-time-s 2 \
  --json-output \
  --disable-custom-kernels \
  --dtype float16```

### Expected behavior

When running LoraX with the model microsoft/Phi-3-mini-128k-instruct, I encountered unexpected behavior with the following configurations:

Configuration A:
- max-input-length = 4096
- max-total-tokens = 8192
- Prompt Length: Approximately 1000 tokens
In this configuration, the generated response differs significantly from what is produced by VLLM.

Configuration B:
- max-input-length = 4090
- max-total-tokens = 4096
This configuration works well and produces expected results.

Additionally, I tested the model microsoft/Phi-3-mini-4k-instruct, and it also functioned correctly.

It seems there may be an issue with handling long contexts when using microsoft/Phi-3-mini-128k-instruct.

Could you please investigate this issue? I found a related discussion here: [Hugging Face Discussion](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/85). Thank you!

Oct 17 '24 07:10 prd-tuong-nguyen

lorax lorax copied to clipboard

Unexpected response with long-context model (Phi-3)

System Info

Information

Tasks

Reproduction

lorax
lorax copied to clipboard