exllama ExLlama API spec / discussion

Opening a new thread to continue conversation re: API as I think having a thread for discussion about this will be valuable as the project continues to scale

Continuation from: https://github.com/turboderp/exllama/issues/12

May 29 '23 16:05 nikshepsvn

I have a fork that is really just a set of drop-in scripts for exllama: https://github.com/disarmyouwitha/exllama/blob/master/fast_api.py https://github.com/disarmyouwitha/exllama/blob/master/fastapi_chat.html https://github.com/disarmyouwitha/exllama/blob/master/fastapi_request.py

fast_api.py is just a FastAPI wrapper around the model and generate_simple functions.(currently) It takes the -d command for the model directory. It will load the model and start listening on port 7862 for POST requests to http://localhost:7862/generate

You can go to /chat to load the HTML through FastAPI, which will allow you to load the page via browser (or mobile if you open port sudo ufw allow 7862)

fastapi_chat.html is a demo GUI written in HTML to be more accessible for editing, I stuffed everything into 1 file because I didn't want to clutter up the repo. It uses tailwind.css to make everything responsive (and mobile friendly)

fastapi_request.py is an example script of how to call the API from python.

I don't intend for this to be the standard or anything, just some reference code to get set up with an API (and what I have personally been using to work with exllama)

Following from our conversation in the last thread, it seems like there is lots of room to be more clever with cacheing, etc.

In my branch I would like to start working on handling concurrent requests and maybe spinning up multiple models behind a load balancer a,la: (catid/supercharger)

May 29 '23 18:05 disarmyouwitha

Well, after I discovered inference on long sequences is 2-4x faster than I thought it was, maybe evaluating every prompt from the beginning isn't such a big deal after all. :)

But if you want maximum performance, it doesn't make sense to run the same tokens through the model over and over. It does get a little annoying having to keep the cache and the generator and the chat logic and the front end in sync, though, so I've settled for a compromise in the basic web UI I'm working on.

I've pushed an update to generator.py with a couple of new functions. One is reset() which just resets the generator back to its initial state, so you can reuse it that way. But more importantly there's gen_begin_reuse() which works like gen_begin() except it checks how much of the new context is identical to its internal state and only does inference from the first changed token onward.

So if you build up a context bit by bit, as in a chat, it will only run whatever gets added through the model (e.g. the user's input or the "Chatbot:" prefix you would run before the bot's response.) If something changes at the beginning of the context, like if you truncate the past to stay within the context length, it will start from scratch.

It seems to work quite well. And since tokenization is pretty cheap, you can just maintain the prompt as a text string and tokenize the whole thing before every generation.

May 29 '23 23:05 turboderp

gen_begin_reuse() this is great^^

Replacing gen_begin() with this in generate_simple() will reuse the cache when having a 1-on-1 conversation while also allowing it to reset if a different conversation is started~

May 30 '23 03:05 disarmyouwitha

+1 on gen_begin_reuse, cool to see that we're able to go a level higher from raw inference logic and make optimizations based on popular use-cases (chat), nice work

May 30 '23 05:05 nikshepsvn

@turboderp Thank you! gen_begin_reuse() works like a charm! And it's pretty exciting to run a 33B model with full context on a 4090 with the crazy speed of ~40 tokens/sec.

@disarmyouwitha thanks for the tips, yep, generate_simple with gen_begin_reuse is a killer combo!

May 30 '23 23:05 epicfilemcnulty

I've been sitting on it for a couple of days, but I have an implementation of the API from oobabooga's text-generation-webui in a simple script: https://gist.github.com/BlankParenthesis/4f490630b6307ec441364ab64f3ce900

Since this is basically a clone of an existing API, it is out-of-the-box compatible with SillyTavern and the like.

There are definitely some problems with it: it lacks many reasonable safeguards involving things like inputs and context length, but I wanted to keep it simple. It seems to work well for the most part.

Jun 08 '23 20:06 BlankParenthesis

exllama exllama copied to clipboard

ExLlama API spec / discussion

exllama
exllama copied to clipboard