text-generation-inference falcon-7b-instruct model unexpected text generation without flash attention

falcon-7b-instruct model unexpected text generation without flash attention

Open chironito opened this issue 2 years ago • 0 comments

System Info

Version: ghcr.io/huggingface/text-generation-inference:latest OS: Ubuntu 22.04 LTS GPU: 1 x A100 80GB GPU on azure 2023-07-04_14-51-20

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

sudo docker run --gpus all -p 8080:80 -v /mnt/ext/data:/data -e USE_FLASH_ATTENTION=FALSE ghcr.io/huggingface/text-generation-inference:latest --model-id tiiuae/falcon-7b-instruct --trust-remote-code 2023-07-04_14-29-53

Expected behavior

sudo docker run --gpus all -p 8080:80 -v /mnt/ext/data:/data -e USE_FLASH_ATTENTION=TRUE ghcr.io/huggingface/text-generation-inference:latest --model-id tiiuae/falcon-7b-instruct --trust-remote-code

If flash attention is switched on, we get correct generation...

2023-07-04_14-36-53

Jul 04 '23 09:07 chironito

text-generation-inference text-generation-inference copied to clipboard

falcon-7b-instruct model unexpected text generation without flash attention

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard