text-generation-webui
text-generation-webui copied to clipboard
Something seems wrong with performance on Nvidia/cuda
Describe the bug
I have a 13th I9, 64gig ddr-5 ram and an idle RTX 3090 Fresh installed with anaconda Running llama 7B in 8 bit mode gives me 4-7 tokens per second, the GPU stays below 1% average utilization in task manager.
that's half the speed the same model gives in CPU using the CPP implementation. Given the CPU speed I would expect 100-500 tokens/sec on a 3090.
So something is very off from the expected speed.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
python server.py --model llama-7b --load-in-8bit
Screenshot
No response
Logs
pip list
Package Version
----------------------------- --------------------
accelerate 0.17.1
aiofiles 23.1.0
aiohttp 3.8.1
aiosignal 1.2.0
alabaster 0.7.12
altair 4.2.2
anaconda-client 1.11.0
anaconda-navigator 2.3.2
anaconda-project 0.10.2
anyio 3.5.0
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
astroid 2.6.6
astropy 5.0.4
asttokens 2.0.5
async-timeout 4.0.1
atomicwrites 1.4.0
attrs 22.1.0
Automat 20.2.0
autopep8 1.6.0
Babel 2.11.0
backcall 0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile 1.0
backports.weakref 1.0.post1
bcrypt 3.2.0
beautifulsoup4 4.11.1
binaryornot 0.4.4
bitarray 2.5.1
bitsandbytes 0.37.1
bkcharts 0.2
black 22.6.0
bleach 4.1.0
bokeh 2.4.2
boto3 1.24.28
botocore 1.27.59
Bottleneck 1.3.4
brotlipy 0.7.0
cachetools 4.2.2
certifi 2022.12.7
cffi 1.15.1
chardet 4.0.0
charset-normalizer 2.0.4
click 8.0.4
clip-anytorch 2.5.0
cloudpickle 2.0.0
clyent 1.2.2
colorama 0.4.6
colorcet 3.0.1
coloredlogs 15.0.1
comm 0.1.2
comtypes 1.1.10
conda 23.1.0
conda-build 3.23.3
conda-content-trust 0.1.3
conda-pack 0.6.0
conda-package-handling 2.0.2
conda_package_streaming 0.7.0
conda-repo-cli 1.0.27
conda-token 0.4.0
conda-verify 3.4.2
constantly 15.1.0
cookiecutter 1.7.3
cryptography 38.0.4
cssselect 1.1.0
cycler 0.11.0
Cython 0.29.33
cytoolz 0.12.0
daal4py 2021.5.0
dask 2022.2.1
datashader 0.13.0
datashape 0.5.4
debugpy 1.5.1
decorator 5.1.1
defusedxml 0.7.1
diff-match-patch 20200713
distributed 2022.2.1
docutils 0.18.1
entrypoints 0.4
et-xmlfile 1.1.0
executing 0.8.3
fairscale 0.4.4
fastapi 0.93.0
fastjsonschema 2.16.2
ffmpy 0.3.0
filelock 3.9.0
fire 0.4.0
flake8 3.9.2
Flask 2.2.2
flatbuffers 2.0.7
flexgen 0.1.7
flit_core 3.6.0
fonttools 4.25.0
frozenlist 1.2.0
fsspec 2022.11.0
ftfy 6.1.1
future 0.18.2
gensim 4.1.2
gitdb 4.0.7
GitPython 3.1.30
glob2 0.7
google-api-core 1.25.1
google-auth 1.33.0
google-cloud-core 1.7.1
google-cloud-storage 1.31.0
google-crc32c 1.1.2
google-resumable-media 1.3.1
googleapis-common-protos 1.53.0
gptj 3.0.9
gradio 3.18.0
greenlet 2.0.1
grpcio 1.42.0
h11 0.14.0
h5py 3.6.0
HeapDict 1.0.1
holoviews 1.14.8
httpcore 0.16.3
httpx 0.23.3
huggingface-hub 0.12.1
humanfriendly 10.0
hvplot 0.7.3
hyperlink 21.0.0
idna 3.4
imagecodecs 2021.8.26
imageio 2.9.0
imagesize 1.4.1
importlib-metadata 4.11.3
incremental 21.3.0
inflection 0.5.1
iniconfig 1.1.1
intake 0.6.5
intervaltree 3.1.0
invisible-watermark 0.1.5
ipykernel 6.19.2
ipython 8.10.0
ipython-genutils 0.2.0
ipywidgets 7.6.5
isort 5.9.3
itemadapter 0.3.0
itemloaders 1.0.4
itsdangerous 2.0.1
jdcal 1.4.1
jedi 0.18.1
Jinja2 3.1.2
jinja2-time 0.2.0
jmespath 0.10.0
joblib 1.1.1
json5 0.9.6
jsonschema 4.17.3
jupyter 1.0.0
jupyter-client 6.1.12
jupyter-console 6.4.0
jupyter_core 5.2.0
jupyter-server 1.23.4
jupyterlab 3.5.3
jupyterlab-pygments 0.1.2
jupyterlab_server 2.19.0
jupyterlab-widgets 1.0.0
keyring 23.4.0
kiwisolver 1.4.4
lazy-object-proxy 1.6.0
libarchive-c 2.9
linkify-it-py 2.0.0
llvmlite 0.38.0
locket 1.0.0
lxml 4.9.1
Markdown 3.4.1
markdown-it-py 2.2.0
MarkupSafe 2.1.1
matplotlib 3.5.1
matplotlib-inline 0.1.6
mccabe 0.6.1
mdit-py-plugins 0.3.5
mdurl 0.1.2
menuinst 1.4.19
mistune 0.8.4
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
mock 4.0.3
mpmath 1.2.1
msgpack 1.0.3
multidict 5.1.0
multipledispatch 0.6.0
munkres 1.1.4
mypy-extensions 0.4.3
navigator-updater 0.3.0
nbclassic 0.5.2
nbclient 0.5.13
nbconvert 6.5.4
nbformat 5.7.0
nest-asyncio 1.5.6
networkx 2.8.4
nltk 3.7
nose 1.3.7
notebook 6.5.2
notebook_shim 0.2.2
numba 0.55.1
numexpr 2.8.1
numpy 1.21.5
numpydoc 1.5.0
olefile 0.46
onnx 1.12.0
onnxruntime 1.12.1
opencv-python 4.6.0.66
openpyxl 3.0.10
orjson 3.8.7
packaging 22.0
pandas 1.5.2
pandocfilters 1.5.0
panel 0.13.0
param 1.12.3
paramiko 2.8.1
parsel 1.6.0
parso 0.8.3
partd 1.2.0
pathlib 1.0.1
pathspec 0.10.3
patsy 0.5.2
peft 0.2.0
pep8 1.7.1
pexpect 4.8.0
picklescan 0.0.8
pickleshare 0.7.5
Pillow 9.3.0
pip 22.3.1
pkginfo 1.8.3
platformdirs 2.5.2
plotly 5.9.0
pluggy 1.0.0
poyo 0.5.0
prometheus-client 0.14.1
prompt-toolkit 3.0.36
Protego 0.1.16
protobuf 3.19.1
psutil 5.9.0
ptyprocess 0.7.0
PuLP 2.7.0
pure-eval 0.2.2
py 1.11.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycocoevalcap 1.2
pycocotools 2.0.6
pycodestyle 2.7.0
pycosat 0.6.4
pycparser 2.21
pycryptodome 3.17
pyct 0.5.0
pycurl 7.45.1
pydantic 1.10.2
PyDispatcher 2.0.5
pydocstyle 6.3.0
pydub 0.25.1
pyerfa 2.0.0
pyflakes 2.3.1
Pygments 2.11.2
PyHamcrest 2.0.2
PyJWT 2.4.0
pylint 2.9.6
pyls-spyder 0.4.0
PyNaCl 1.5.0
pyodbc 4.0.34
pyOpenSSL 22.0.0
pyparsing 3.0.9
pyreadline 2.1
pyreadline3 3.4.1
pyrsistent 0.18.0
PySocks 1.7.1
pytest 7.1.2
python-dateutil 2.8.2
python-lsp-black 1.0.0
python-lsp-jsonrpc 1.0.0
python-lsp-server 1.2.4
python-multipart 0.0.6
python-slugify 5.0.2
python-snappy 0.6.1
pytoolconfig 1.2.5
pytz 2022.7
pyviz-comms 2.0.2
PyWavelets 1.3.0
pywin32 305.1
pywin32-ctypes 0.2.0
pywinpty 2.0.2
PyYAML 6.0
pyzmq 23.2.0
QDarkStyle 3.0.2
qstylizer 0.2.2
QtAwesome 1.2.2
qtconsole 5.4.0
QtPy 2.2.0
queuelib 1.5.0
regex 2022.7.9
requests 2.28.1
requests-file 1.5.1
rfc3986 1.5.0
rope 1.7.0
rsa 4.7.2
Rtree 1.0.1
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
ruamel-yaml-conda 0.17.21
rwkv 0.4.2
s3transfer 0.6.0
sacremoses 0.0.43
safetensors 0.3.0
scikit-image 0.19.2
scikit-learn 1.0.2
scikit-learn-intelex 2021.20220215.102710
scipy 1.7.3
Scrapy 2.6.2
seaborn 0.11.2
Send2Trash 1.8.0
sentencepiece 0.1.97
service-identity 18.1.0
setuptools 65.6.3
sip 4.19.13
six 1.16.0
smart-open 5.2.1
smmap 4.0.0
sniffio 1.2.0
snowballstemmer 2.2.0
sortedcollections 2.1.0
sortedcontainers 2.4.0
soupsieve 2.3.2.post1
Sphinx 5.0.2
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
spyder 5.1.5
spyder-kernels 2.1.3
SQLAlchemy 1.4.39
stack-data 0.2.0
starlette 0.25.0
statsmodels 0.13.2
sympy 1.11.1
tables 3.6.1
tabulate 0.8.10
TBB 0.2
tblib 1.7.0
tenacity 8.0.1
termcolor 2.0.1
terminado 0.17.1
testpath 0.6.0
text-unidecode 1.3
textdistance 4.2.1
threadpoolctl 2.2.0
three-merge 0.1.1
tifffile 2021.7.2
timm 0.4.12
tinycss 0.4
tinycss2 1.2.1
tldextract 3.2.0
tokenizers 0.13.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.0
torch 1.13.1
torchaudio 0.13.1
torchsummary 1.5.1
torchvision 0.14.1
tornado 6.2
tqdm 4.64.1
traitlets 5.7.1
transformers 4.27.0.dev0
Twisted 22.2.0
twisted-iocpsupport 1.0.2
typed-ast 1.4.3
typing_extensions 4.4.0
uc-micro-py 1.0.1
ujson 5.4.0
Unidecode 1.2.0
urllib3 1.26.14
uvicorn 0.21.0
var-dump 1.2
w3lib 1.21.0
watchdog 2.1.6
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
websockets 10.4
Werkzeug 2.2.2
wheel 0.38.4
widgetsnbextension 3.5.2
win-inet-pton 1.1.0
win-unicode-console 0.5
wincertstore 0.2
wrapt 1.12.1
xarray 0.20.1
xlrd 2.0.1
XlsxWriter 3.0.3
xlwings 0.29.1
yapf 0.31.0
yarl 1.6.3
zict 2.1.0
zipp 3.11.0
zope.interface 5.4.0
zstandard 0.19.0
System Info
Win 11
64 gig DDR5 - 5600
RTX 3090
8bit on gpu via bitsandbytes is known to be slower than fp16. On a 3090 you should be able to fit the full fp16 version of the model so there isn't really a reason to run in 8bit. If you want to minimize quality loss LLaMa 13B at 4bit uses roughly the same amount of memory as 7B in 8bit and has better benchmark scores.
You're unlikely to get anywhere near 100t/s because you are bottlenecked by gpu memory bandwidth. That's the reason why cpus are able to compete at all.
8bit on gpu via bitsandbytes is known to be slower than fp16. On a 3090 you should be able to fit the full fp16 version of the model so there isn't really a reason to run in 8bit. If you want to minimize quality loss LLaMa 13B at 4bit uses roughly the same amount of memory as 7B in 8bit and has better benchmark scores.
You're unlikely to get anywhere near 100t/s because you are bottlenecked by gpu memory bandwidth. That's the reason why cpus are able to compete at all.
Appreciate your response but that is definitely not the reason. When memory bandwidth is used on the 3090 the card draws 200+W of consistent power and heats up to 90°C in seconds (that's with idle tensor cores). The repository implementation does not heat up the card, it is completely idle during the entire process. Compare that to something like stable diffusion, that can be used in winter to heat a large room and it's far from bein fully optimized.
The memory bandwidth on the 3090 is more than 10 times higher than that of my DDR5 PC. The bandwidth of the 4090 would be 30 times higher. The processing speed is thousands of times higher for the FP calculations.
If 10+ t/sec is possible on CPU I would expect 100t/s as a minimum (that's just the memory bandwidth advantage), given that bottleneck is so much wider and the processing speed is not even compareable the total speed needs to be much more than that.
Something is not right in the GPU implementation. As long as the GPU stays cold it's not being used.
8bit on gpu via bitsandbytes is known to be slower than fp16. On a 3090 you should be able to fit the full fp16 version of the model so there isn't really a reason to run in 8bit. If you want to minimize quality loss LLaMa 13B at 4bit uses roughly the same amount of memory as 7B in 8bit and has better benchmark scores. You're unlikely to get anywhere near 100t/s because you are bottlenecked by gpu memory bandwidth. That's the reason why cpus are able to compete at all.
Appreciate your response but that is definitely not the reason. When memory bandwidth is used on the 3090 the card draws 200+W of consistent power and heats up to 90°C in seconds (that's with idle tensor cores). The repository implementation does not heat up the card, it is completely idle during the entire process. Compare that to something like stable diffusion, that can be used in winter to heat a large room and it's far from bein fully optimized.
The memory bandwidth on the 3090 is more than 10 times higher than that of my DDR5 PC. The bandwidth of the 4090 would be 30 times higher. The processing speed is thousands of times higher for the FP calculations.
If 10+ t/sec is possible on CPU I would expect 100t/s as a minimum (that's just the memory bandwidth advantage), given that bottleneck is so much wider and the processing speed is not even compareable the total speed needs to be much more than that.
Something is not right in the GPU implementation. As long as the GPU stays cold it's not being used.
I noticed in your pip list there are no nvidia packages. Try "conda install -c conda-forge cudatoolkit-dev". Also what do you see when you run "nvidia-smi -lms"?
I agree with you. The RTX3090 have 24GB VRAM so it should has no problem in 8bit 7B model(9.2GB). I think you will be able to deal with 8bit 13B or 4bit 30B without use slower RAM.
For more information, would you please help us compare the performance of different models at your GPU and CPU? For example,
- 4bit 7B model in i9 CPU, [text-generation-webui]
- 4bit 7B model in 3090 GPU, [text-generation-webui]
- 4bit 7B model in i9 CPU, [llama.CPP] and for reference only, to show your cuda and driver works normally:
- stable diffusion in i9 CPU
- stable diffusion in 3090 GPU
Do this to 8bit 7B and lager models, record the usage of CPU, GPU(cuda,not 3D), RAM and VRAM in each case. I think this data may be very useful for someone who want to study this preformance problem.
Also you can campare your result with this testing: tomshardware.com/news/running-your-own-chatbot-on-a-single-gpu In which a 3090 showes 20.8 token/s on 4bit 13B model.
You all have correct GPU accelerated bits and bytes on windows, right? And proper cuda using torch/etc... Because it sounds like it isn't so.
The performance on my RTX 4090 GPU is really slow too. It should be a lot faster.
It definitely feels slower on nerys OPT model comparing to Kobold running on the same hardware.
On Kobold with input = "test" it generates 200 tokens in about 9 seconds.
Oobabooga takes at least 13 seconds (in kobold api emulation) and up to 20 if I try to match parameters manually.
And that's on small context. On big (~1K tokens) the difference is even bigger to the point I thought i was running CPU mode.
I seem to also have an issue with speed.
GPU accelerated bits and bytes on windows, right?
Got the DLL for cuda116, and task manager shows Cuda is being used (24% for the 7B model)
But I'm getting generation speeds at 4-9 tokens/second. I've noticed some posts of reddit comparing speeds, my GPU should be getting 24 tokens/s (with the 13B model.)
I don't know what to check. Maybe compiling bits and bytes myself. (But after seeing Cuda is being used, I feel that might not be the issue) Edit: Also, I'm using 4bit models
Second this. I am on the latest update for all packages in the webui, and using TheBloke_Llama-2-13B-GPTQ in ExLlama on a 4090 is only getting me 4.19-5.21 tokens/second. I see the models load in VRAM, but my GPU utilization when generating is only 8%.
So... I just re-installed using the auto-installer and constantly getting this error even when trying to load using AutoGPT.
Exllama kernel is not installed
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.