ggml icon indicating copy to clipboard operation
ggml copied to clipboard

Keeping the model loaded on RAM

Open regstuff opened this issue 2 years ago • 6 comments

Is there a way to keep the model loaded in the RAM between successive runs? I have an api like setup, and every time a prompt comes in, the model has to be loaded into RAM again, which takes a while for GPT-J. I'm using python and basically just running the ./bin/gpt-j command via subprocess.Popen.

regstuff avatar Mar 01 '23 12:03 regstuff

to run it like a local service, you mean? yeah I'd love to be able to do this too...

jishnudg avatar Mar 01 '23 15:03 jishnudg

"All" you have to do is modify the for loop here: https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L649. Instead of breaking when you hit the end of text token or the max token count, wait for user input.

umhau avatar Mar 02 '23 13:03 umhau

@umhau that will just add extra input right? I think the idea is (at least in my use case) to be able to start a new prompt without reloading the whole model.

biemster avatar Mar 02 '23 13:03 biemster

@biemster In that case, this section gives the arguments for the gptj_eval function. It looks like you should be able to reset the n_past, embd_inp, and embd_w variables to the starting values when you're done with each prompt. Then it's a matter of modifying the for loop so it clears the vars when it's done, and waits for your next round of input.

umhau avatar Mar 03 '23 01:03 umhau

Did anyone do this? Running this as a service would be great.

mallorbc avatar Mar 30 '23 03:03 mallorbc

The mmap()/mlock() changes in llama.cpp should be applicable here.

apaz-cli avatar Apr 12 '23 00:04 apaz-cli

Inference is painfully slow on CPU-Only setup and it seems to be because of this issue.

I'm using ctransformers which I believe to be using this library to run the models and I found that the model is not fully loaded into memory while doing an inference with CPU-Only and then I found this comment on GGML's Features in the README: "Zero memory allocations during runtime".

Same issue with another independent library using llama.cpp: https://github.com/abetlen/llama-cpp-python

Seems to me this library is probably to blame.

Questions:

  1. Does this mean GGML indeed does not ever load the whole model into RAM? That seems to me like a waste of RAM for users with high RAM.

  2. Is the library just doing lots of reads to disk during inference?

  3. Is there an easy option to let the library load the whole model into RAM and keep it there?

This thread makes it sound like the whole model is being loaded into RAM and then unloaded after inference, but that does not seem to be case, since I have 3GB model and RAM never goes above 1.2.

Here is the Colab to test it (keep an eye on resources during inference): https://colab.research.google.com/drive/1iGifBXEaXI2JDbJG1Il7BAS8gqVWm8kR

jacmkno avatar Aug 11 '23 06:08 jacmkno