lorax
lorax copied to clipboard
Unexpected response with long-context model (Phi-3)
trafficstars
System Info
ghcr.io/predibase/lorax:f1ef0ee
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
--port 80 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--cuda-memory-fraction 0.8 \
--sharded false \
--max-waiting-tokens 20 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--hostname 0.0.0.0 \
--max-concurrent-requests 512 \
--max-best-of 1 \
--max-batch-prefill-tokens $BATCH_TOKEN \
--max-active-adapters 10 \
--adapter-source local \
--adapter-cycle-time-s 2 \
--json-output \
--disable-custom-kernels \
--dtype float16```
### Expected behavior
When running LoraX with the model microsoft/Phi-3-mini-128k-instruct, I encountered unexpected behavior with the following configurations:
Configuration A:
- max-input-length = 4096
- max-total-tokens = 8192
- Prompt Length: Approximately 1000 tokens
In this configuration, the generated response differs significantly from what is produced by VLLM.
Configuration B:
- max-input-length = 4090
- max-total-tokens = 4096
This configuration works well and produces expected results.
Additionally, I tested the model microsoft/Phi-3-mini-4k-instruct, and it also functioned correctly.
It seems there may be an issue with handling long contexts when using microsoft/Phi-3-mini-128k-instruct.
Could you please investigate this issue? I found a related discussion here: [Hugging Face Discussion](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/85). Thank you!