text-generation-inference
text-generation-inference copied to clipboard
falcon-7b-instruct model unexpected text generation without flash attention
System Info
Version: ghcr.io/huggingface/text-generation-inference:latest
OS: Ubuntu 22.04 LTS
GPU: 1 x A100 80GB GPU on azure
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
sudo docker run --gpus all -p 8080:80 -v /mnt/ext/data:/data -e USE_FLASH_ATTENTION=FALSE ghcr.io/huggingface/text-generation-inference:latest --model-id tiiuae/falcon-7b-instruct --trust-remote-code
Expected behavior
sudo docker run --gpus all -p 8080:80 -v /mnt/ext/data:/data -e USE_FLASH_ATTENTION=TRUE ghcr.io/huggingface/text-generation-inference:latest --model-id tiiuae/falcon-7b-instruct --trust-remote-code
If flash attention is switched on, we get correct generation...