text-generation-webui Something seems wrong with performance on Nvidia/cuda

Describe the bug

I have a 13th I9, 64gig ddr-5 ram and an idle RTX 3090 Fresh installed with anaconda Running llama 7B in 8 bit mode gives me 4-7 tokens per second, the GPU stays below 1% average utilization in task manager.

that's half the speed the same model gives in CPU using the CPP implementation. Given the CPU speed I would expect 100-500 tokens/sec on a 3090.

So something is very off from the expected speed.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

python server.py --model llama-7b --load-in-8bit

Screenshot

No response

Logs

pip list
Package                       Version
----------------------------- --------------------
accelerate                    0.17.1
aiofiles                      23.1.0
aiohttp                       3.8.1
aiosignal                     1.2.0
alabaster                     0.7.12
altair                        4.2.2
anaconda-client               1.11.0
anaconda-navigator            2.3.2
anaconda-project              0.10.2
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.3
astroid                       2.6.6
astropy                       5.0.4
asttokens                     2.0.5
async-timeout                 4.0.1
atomicwrites                  1.4.0
attrs                         22.1.0
Automat                       20.2.0
autopep8                      1.6.0
Babel                         2.11.0
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile            1.0
backports.weakref             1.0.post1
bcrypt                        3.2.0
beautifulsoup4                4.11.1
binaryornot                   0.4.4
bitarray                      2.5.1
bitsandbytes                  0.37.1
bkcharts                      0.2
black                         22.6.0
bleach                        4.1.0
bokeh                         2.4.2
boto3                         1.24.28
botocore                      1.27.59
Bottleneck                    1.3.4
brotlipy                      0.7.0
cachetools                    4.2.2
certifi                       2022.12.7
cffi                          1.15.1
chardet                       4.0.0
charset-normalizer            2.0.4
click                         8.0.4
clip-anytorch                 2.5.0
cloudpickle                   2.0.0
clyent                        1.2.2
colorama                      0.4.6
colorcet                      3.0.1
coloredlogs                   15.0.1
comm                          0.1.2
comtypes                      1.1.10
conda                         23.1.0
conda-build                   3.23.3
conda-content-trust           0.1.3
conda-pack                    0.6.0
conda-package-handling        2.0.2
conda_package_streaming       0.7.0
conda-repo-cli                1.0.27
conda-token                   0.4.0
conda-verify                  3.4.2
constantly                    15.1.0
cookiecutter                  1.7.3
cryptography                  38.0.4
cssselect                     1.1.0
cycler                        0.11.0
Cython                        0.29.33
cytoolz                       0.12.0
daal4py                       2021.5.0
dask                          2022.2.1
datashader                    0.13.0
datashape                     0.5.4
debugpy                       1.5.1
decorator                     5.1.1
defusedxml                    0.7.1
diff-match-patch              20200713
distributed                   2022.2.1
docutils                      0.18.1
entrypoints                   0.4
et-xmlfile                    1.1.0
executing                     0.8.3
fairscale                     0.4.4
fastapi                       0.93.0
fastjsonschema                2.16.2
ffmpy                         0.3.0
filelock                      3.9.0
fire                          0.4.0
flake8                        3.9.2
Flask                         2.2.2
flatbuffers                   2.0.7
flexgen                       0.1.7
flit_core                     3.6.0
fonttools                     4.25.0
frozenlist                    1.2.0
fsspec                        2022.11.0
ftfy                          6.1.1
future                        0.18.2
gensim                        4.1.2
gitdb                         4.0.7
GitPython                     3.1.30
glob2                         0.7
google-api-core               1.25.1
google-auth                   1.33.0
google-cloud-core             1.7.1
google-cloud-storage          1.31.0
google-crc32c                 1.1.2
google-resumable-media        1.3.1
googleapis-common-protos      1.53.0
gptj                          3.0.9
gradio                        3.18.0
greenlet                      2.0.1
grpcio                        1.42.0
h11                           0.14.0
h5py                          3.6.0
HeapDict                      1.0.1
holoviews                     1.14.8
httpcore                      0.16.3
httpx                         0.23.3
huggingface-hub               0.12.1
humanfriendly                 10.0
hvplot                        0.7.3
hyperlink                     21.0.0
idna                          3.4
imagecodecs                   2021.8.26
imageio                       2.9.0
imagesize                     1.4.1
importlib-metadata            4.11.3
incremental                   21.3.0
inflection                    0.5.1
iniconfig                     1.1.1
intake                        0.6.5
intervaltree                  3.1.0
invisible-watermark           0.1.5
ipykernel                     6.19.2
ipython                       8.10.0
ipython-genutils              0.2.0
ipywidgets                    7.6.5
isort                         5.9.3
itemadapter                   0.3.0
itemloaders                   1.0.4
itsdangerous                  2.0.1
jdcal                         1.4.1
jedi                          0.18.1
Jinja2                        3.1.2
jinja2-time                   0.2.0
jmespath                      0.10.0
joblib                        1.1.1
json5                         0.9.6
jsonschema                    4.17.3
jupyter                       1.0.0
jupyter-client                6.1.12
jupyter-console               6.4.0
jupyter_core                  5.2.0
jupyter-server                1.23.4
jupyterlab                    3.5.3
jupyterlab-pygments           0.1.2
jupyterlab_server             2.19.0
jupyterlab-widgets            1.0.0
keyring                       23.4.0
kiwisolver                    1.4.4
lazy-object-proxy             1.6.0
libarchive-c                  2.9
linkify-it-py                 2.0.0
llvmlite                      0.38.0
locket                        1.0.0
lxml                          4.9.1
Markdown                      3.4.1
markdown-it-py                2.2.0
MarkupSafe                    2.1.1
matplotlib                    3.5.1
matplotlib-inline             0.1.6
mccabe                        0.6.1
mdit-py-plugins               0.3.5
mdurl                         0.1.2
menuinst                      1.4.19
mistune                       0.8.4
mkl-fft                       1.3.1
mkl-random                    1.2.2
mkl-service                   2.4.0
mock                          4.0.3
mpmath                        1.2.1
msgpack                       1.0.3
multidict                     5.1.0
multipledispatch              0.6.0
munkres                       1.1.4
mypy-extensions               0.4.3
navigator-updater             0.3.0
nbclassic                     0.5.2
nbclient                      0.5.13
nbconvert                     6.5.4
nbformat                      5.7.0
nest-asyncio                  1.5.6
networkx                      2.8.4
nltk                          3.7
nose                          1.3.7
notebook                      6.5.2
notebook_shim                 0.2.2
numba                         0.55.1
numexpr                       2.8.1
numpy                         1.21.5
numpydoc                      1.5.0
olefile                       0.46
onnx                          1.12.0
onnxruntime                   1.12.1
opencv-python                 4.6.0.66
openpyxl                      3.0.10
orjson                        3.8.7
packaging                     22.0
pandas                        1.5.2
pandocfilters                 1.5.0
panel                         0.13.0
param                         1.12.3
paramiko                      2.8.1
parsel                        1.6.0
parso                         0.8.3
partd                         1.2.0
pathlib                       1.0.1
pathspec                      0.10.3
patsy                         0.5.2
peft                          0.2.0
pep8                          1.7.1
pexpect                       4.8.0
picklescan                    0.0.8
pickleshare                   0.7.5
Pillow                        9.3.0
pip                           22.3.1
pkginfo                       1.8.3
platformdirs                  2.5.2
plotly                        5.9.0
pluggy                        1.0.0
poyo                          0.5.0
prometheus-client             0.14.1
prompt-toolkit                3.0.36
Protego                       0.1.16
protobuf                      3.19.1
psutil                        5.9.0
ptyprocess                    0.7.0
PuLP                          2.7.0
pure-eval                     0.2.2
py                            1.11.0
pyasn1                        0.4.8
pyasn1-modules                0.2.8
pycocoevalcap                 1.2
pycocotools                   2.0.6
pycodestyle                   2.7.0
pycosat                       0.6.4
pycparser                     2.21
pycryptodome                  3.17
pyct                          0.5.0
pycurl                        7.45.1
pydantic                      1.10.2
PyDispatcher                  2.0.5
pydocstyle                    6.3.0
pydub                         0.25.1
pyerfa                        2.0.0
pyflakes                      2.3.1
Pygments                      2.11.2
PyHamcrest                    2.0.2
PyJWT                         2.4.0
pylint                        2.9.6
pyls-spyder                   0.4.0
PyNaCl                        1.5.0
pyodbc                        4.0.34
pyOpenSSL                     22.0.0
pyparsing                     3.0.9
pyreadline                    2.1
pyreadline3                   3.4.1
pyrsistent                    0.18.0
PySocks                       1.7.1
pytest                        7.1.2
python-dateutil               2.8.2
python-lsp-black              1.0.0
python-lsp-jsonrpc            1.0.0
python-lsp-server             1.2.4
python-multipart              0.0.6
python-slugify                5.0.2
python-snappy                 0.6.1
pytoolconfig                  1.2.5
pytz                          2022.7
pyviz-comms                   2.0.2
PyWavelets                    1.3.0
pywin32                       305.1
pywin32-ctypes                0.2.0
pywinpty                      2.0.2
PyYAML                        6.0
pyzmq                         23.2.0
QDarkStyle                    3.0.2
qstylizer                     0.2.2
QtAwesome                     1.2.2
qtconsole                     5.4.0
QtPy                          2.2.0
queuelib                      1.5.0
regex                         2022.7.9
requests                      2.28.1
requests-file                 1.5.1
rfc3986                       1.5.0
rope                          1.7.0
rsa                           4.7.2
Rtree                         1.0.1
ruamel.yaml                   0.17.21
ruamel.yaml.clib              0.2.6
ruamel-yaml-conda             0.17.21
rwkv                          0.4.2
s3transfer                    0.6.0
sacremoses                    0.0.43
safetensors                   0.3.0
scikit-image                  0.19.2
scikit-learn                  1.0.2
scikit-learn-intelex          2021.20220215.102710
scipy                         1.7.3
Scrapy                        2.6.2
seaborn                       0.11.2
Send2Trash                    1.8.0
sentencepiece                 0.1.97
service-identity              18.1.0
setuptools                    65.6.3
sip                           4.19.13
six                           1.16.0
smart-open                    5.2.1
smmap                         4.0.0
sniffio                       1.2.0
snowballstemmer               2.2.0
sortedcollections             2.1.0
sortedcontainers              2.4.0
soupsieve                     2.3.2.post1
Sphinx                        5.0.2
sphinxcontrib-applehelp       1.0.2
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        2.0.0
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.5
spyder                        5.1.5
spyder-kernels                2.1.3
SQLAlchemy                    1.4.39
stack-data                    0.2.0
starlette                     0.25.0
statsmodels                   0.13.2
sympy                         1.11.1
tables                        3.6.1
tabulate                      0.8.10
TBB                           0.2
tblib                         1.7.0
tenacity                      8.0.1
termcolor                     2.0.1
terminado                     0.17.1
testpath                      0.6.0
text-unidecode                1.3
textdistance                  4.2.1
threadpoolctl                 2.2.0
three-merge                   0.1.1
tifffile                      2021.7.2
timm                          0.4.12
tinycss                       0.4
tinycss2                      1.2.1
tldextract                    3.2.0
tokenizers                    0.13.2
toml                          0.10.2
tomli                         2.0.1
toolz                         0.12.0
torch                         1.13.1
torchaudio                    0.13.1
torchsummary                  1.5.1
torchvision                   0.14.1
tornado                       6.2
tqdm                          4.64.1
traitlets                     5.7.1
transformers                  4.27.0.dev0
Twisted                       22.2.0
twisted-iocpsupport           1.0.2
typed-ast                     1.4.3
typing_extensions             4.4.0
uc-micro-py                   1.0.1
ujson                         5.4.0
Unidecode                     1.2.0
urllib3                       1.26.14
uvicorn                       0.21.0
var-dump                      1.2
w3lib                         1.21.0
watchdog                      2.1.6
wcwidth                       0.2.5
webencodings                  0.5.1
websocket-client              0.58.0
websockets                    10.4
Werkzeug                      2.2.2
wheel                         0.38.4
widgetsnbextension            3.5.2
win-inet-pton                 1.1.0
win-unicode-console           0.5
wincertstore                  0.2
wrapt                         1.12.1
xarray                        0.20.1
xlrd                          2.0.1
XlsxWriter                    3.0.3
xlwings                       0.29.1
yapf                          0.31.0
yarl                          1.6.3
zict                          2.1.0
zipp                          3.11.0
zope.interface                5.4.0
zstandard                     0.19.0

System Info

Win 11
64 gig DDR5 - 5600
RTX 3090

Mar 22 '23 04:03 cmp-nct

8bit on gpu via bitsandbytes is known to be slower than fp16. On a 3090 you should be able to fit the full fp16 version of the model so there isn't really a reason to run in 8bit. If you want to minimize quality loss LLaMa 13B at 4bit uses roughly the same amount of memory as 7B in 8bit and has better benchmark scores.

You're unlikely to get anywhere near 100t/s because you are bottlenecked by gpu memory bandwidth. That's the reason why cpus are able to compete at all.

Mar 22 '23 11:03 BetaDoggo

8bit on gpu via bitsandbytes is known to be slower than fp16. On a 3090 you should be able to fit the full fp16 version of the model so there isn't really a reason to run in 8bit. If you want to minimize quality loss LLaMa 13B at 4bit uses roughly the same amount of memory as 7B in 8bit and has better benchmark scores.

You're unlikely to get anywhere near 100t/s because you are bottlenecked by gpu memory bandwidth. That's the reason why cpus are able to compete at all.

Appreciate your response but that is definitely not the reason. When memory bandwidth is used on the 3090 the card draws 200+W of consistent power and heats up to 90°C in seconds (that's with idle tensor cores). The repository implementation does not heat up the card, it is completely idle during the entire process. Compare that to something like stable diffusion, that can be used in winter to heat a large room and it's far from bein fully optimized.

The memory bandwidth on the 3090 is more than 10 times higher than that of my DDR5 PC. The bandwidth of the 4090 would be 30 times higher. The processing speed is thousands of times higher for the FP calculations.

If 10+ t/sec is possible on CPU I would expect 100t/s as a minimum (that's just the memory bandwidth advantage), given that bottleneck is so much wider and the processing speed is not even compareable the total speed needs to be much more than that.

Something is not right in the GPU implementation. As long as the GPU stays cold it's not being used.

Mar 22 '23 14:03 cmp-nct

8bit on gpu via bitsandbytes is known to be slower than fp16. On a 3090 you should be able to fit the full fp16 version of the model so there isn't really a reason to run in 8bit. If you want to minimize quality loss LLaMa 13B at 4bit uses roughly the same amount of memory as 7B in 8bit and has better benchmark scores. You're unlikely to get anywhere near 100t/s because you are bottlenecked by gpu memory bandwidth. That's the reason why cpus are able to compete at all.

Appreciate your response but that is definitely not the reason. When memory bandwidth is used on the 3090 the card draws 200+W of consistent power and heats up to 90°C in seconds (that's with idle tensor cores). The repository implementation does not heat up the card, it is completely idle during the entire process. Compare that to something like stable diffusion, that can be used in winter to heat a large room and it's far from bein fully optimized.

The memory bandwidth on the 3090 is more than 10 times higher than that of my DDR5 PC. The bandwidth of the 4090 would be 30 times higher. The processing speed is thousands of times higher for the FP calculations.

If 10+ t/sec is possible on CPU I would expect 100t/s as a minimum (that's just the memory bandwidth advantage), given that bottleneck is so much wider and the processing speed is not even compareable the total speed needs to be much more than that.

Something is not right in the GPU implementation. As long as the GPU stays cold it's not being used.

I noticed in your pip list there are no nvidia packages. Try "conda install -c conda-forge cudatoolkit-dev". Also what do you see when you run "nvidia-smi -lms"?

Mar 22 '23 17:03 kagevazquez

I agree with you. The RTX3090 have 24GB VRAM so it should has no problem in 8bit 7B model(9.2GB). I think you will be able to deal with 8bit 13B or 4bit 30B without use slower RAM.

For more information, would you please help us compare the performance of different models at your GPU and CPU? For example,

4bit 7B model in i9 CPU, [text-generation-webui]
4bit 7B model in 3090 GPU, [text-generation-webui]
4bit 7B model in i9 CPU, [llama.CPP] and for reference only, to show your cuda and driver works normally:
stable diffusion in i9 CPU
stable diffusion in 3090 GPU

Do this to 8bit 7B and lager models, record the usage of CPU, GPU(cuda,not 3D), RAM and VRAM in each case. I think this data may be very useful for someone who want to study this preformance problem.

Also you can campare your result with this testing: tomshardware.com/news/running-your-own-chatbot-on-a-single-gpu In which a 3090 showes 20.8 token/s on 4bit 13B model.

Mar 22 '23 18:03 zyxpixel

You all have correct GPU accelerated bits and bytes on windows, right? And proper cuda using torch/etc... Because it sounds like it isn't so.

Mar 24 '23 14:03 Ph0rk0z

The performance on my RTX 4090 GPU is really slow too. It should be a lot faster.

Mar 25 '23 15:03 sdanfa

It definitely feels slower on nerys OPT model comparing to Kobold running on the same hardware.

On Kobold with input = "test" it generates 200 tokens in about 9 seconds.

Oobabooga takes at least 13 seconds (in kobold api emulation) and up to 20 if I try to match parameters manually.

And that's on small context. On big (~1K tokens) the difference is even bigger to the point I thought i was running CPU mode.

Apr 03 '23 10:04 Maykeye

I seem to also have an issue with speed.

GPU accelerated bits and bytes on windows, right?

Got the DLL for cuda116, and task manager shows Cuda is being used (24% for the 7B model)

But I'm getting generation speeds at 4-9 tokens/second. I've noticed some posts of reddit comparing speeds, my GPU should be getting 24 tokens/s (with the 13B model.)

I don't know what to check. Maybe compiling bits and bytes myself. (But after seeing Cuda is being used, I feel that might not be the issue) Edit: Also, I'm using 4bit models

Apr 09 '23 19:04 Artificiangel

Second this. I am on the latest update for all packages in the webui, and using TheBloke_Llama-2-13B-GPTQ in ExLlama on a 4090 is only getting me 4.19-5.21 tokens/second. I see the models load in VRAM, but my GPU utilization when generating is only 8%.

Aug 09 '23 18:08 madcatandrew

So... I just re-installed using the auto-installer and constantly getting this error even when trying to load using AutoGPT.

Exllama kernel is not installed

Aug 17 '23 20:08 bbecausereasonss

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Nov 29 '23 23:11 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

Something seems wrong with performance on Nvidia/cuda

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard