text-generation-webui recent update kills API performance & brings all the old API bugs back

recent update kills API performance & brings all the old API bugs back

Open DeSinc opened this issue 1 year ago • 0 comments

Describe the bug

Before the recent update to the API, on kobold API (--extensions api) I was getting full, normal paragraph replies in a little over 4-5 seconds or so. Now it starts lagging when it gets past 3 lines and can take up to a minute to complete.

And the replies are not normal. The bot spams hashtags after a short while, a bug that was present in the old default API that used to run on port 7860. This was just about the main reason why I switched to --extensions api in the first place, and now it has come to plague the kobold api as well after this update.

As for the poor performance, it makes the same sound as running out of vram, swapping memory. BZZZZZZzzzzzzz.z.z.z.z.z. z. z. z.. z. z . z. z. z. z. z. z.

Before the update, I was loading literally 5000 character long chat history into the model and getting full paragraph-long responses out in like 6 seconds tops for non-weird text. It never slowed down before.

Now, after the update, something is broken such that on the exact same 7B model is hitting this dead zone even just 3 sentences in for some reason.

It's not the VRAM

I'm not running out of vram, it's not a vram issue, it's the same setup as before. Here is my vram usage, as you can see I am not running out.

Models:

Issue is happening with Alpaca 7B native, and also tested with Vicuna 1.1 7B, same issue. The issue did not happen with the old --extensions api which has obviously been pasted over with the old broken default API for whatever reason.

Hashtag psychosis has also returned

Another bug that was happening with the old, broken, default API has now returned after the update. Hashtag psychosis is where the bot, out of nowhere, starts spamming hashtags and never shuts up. It's really annoying, and it was exactly half the reason I switched to the --extensions api which actually functioned correctly.

It's not the JSON that I'm sending

Inspecting the JSON there are no erroneous spaces or hashtags. Moreover the proof is in the fact that it worked completely fine on the old code for weeks, straight, with neither of these issues occurring even once, including hashtag psychosis. This was a default API problem, not a kobold API problem. Something has brought all of the old default API's issues to the extension API now, including the erroneously added space to the end of the prompt that causes emoji psychosis. (This was a default API bug too, and it is now also suspiciously present in the --extension api where it was not previously.)

Args:

python server.py --model alpaca-native-4bit --wbits 4 --groupsize 128 --extensions api --notebook --xformers

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Run a 7B model, any is fine, preferrably one you know the performance of on, say, make it do a 200 token long response. Use your same params as always, or comment out all the params and just send a prompt by itself, either way it does the same thing. Before, this wouldn't take too long. Now, it slows to a crawl by like line 3 or 4 and starts sounding like it's swapping (if you listen to your card's coil whine as an indicator)

Logs

no errors, just normal operation - but slow, and with too many hashtags out of the LLM where I didn't type a single one.

127.0.0.1 - - [26/Apr/2023 01:45:42] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 2.21 seconds (15.86 tokens/s, 35 tokens, context 205, seed 1658051178)
...
127.0.0.1 - - [26/Apr/2023 01:53:29] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 2.66 seconds (13.91 tokens/s, 37 tokens, context 131, seed 77482115)
127.0.0.1 - - [26/Apr/2023 01:53:52] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 40.22 seconds (4.97 tokens/s, 200 tokens, context 125, seed 560901880)

Notice the last output is insanely slow? under 5 tokens per second? Before the update I could generate a 200 token response in under 12 seconds, now it's over tripling that time.

The output was not random text, I am aware that stuff can take longer to generate - it's not that. These were just normal words that would generate quite fast before. It does it every time now. I've had this running for weeks on end and it's never been this slow.

System Info

3080ti 12GB

Apr 25 '23 15:04 DeSinc

text-generation-webui text-generation-webui copied to clipboard

recent update kills API performance & brings all the old API bugs back

Describe the bug

It's not the VRAM

Models:

Hashtag psychosis has also returned

It's not the JSON that I'm sending

Args:

Is there an existing issue for this?

Reproduction

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard