ggml
ggml copied to clipboard
Keeping the model loaded on RAM
Is there a way to keep the model loaded in the RAM between successive runs? I have an api like setup, and every time a prompt comes in, the model has to be loaded into RAM again, which takes a while for GPT-J. I'm using python and basically just running the ./bin/gpt-j command via subprocess.Popen.
to run it like a local service, you mean? yeah I'd love to be able to do this too...
"All" you have to do is modify the for loop here: https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L649. Instead of breaking when you hit the end of text token or the max token count, wait for user input.
@umhau that will just add extra input right? I think the idea is (at least in my use case) to be able to start a new prompt without reloading the whole model.
@biemster In that case, this section gives the arguments for the gptj_eval function. It looks like you should be able to reset the n_past, embd_inp, and embd_w variables to the starting values when you're done with each prompt. Then it's a matter of modifying the for loop so it clears the vars when it's done, and waits for your next round of input.
Did anyone do this? Running this as a service would be great.
The mmap()/mlock() changes in llama.cpp should be applicable here.
Inference is painfully slow on CPU-Only setup and it seems to be because of this issue.
I'm using ctransformers which I believe to be using this library to run the models and I found that the model is not fully loaded into memory while doing an inference with CPU-Only and then I found this comment on GGML's Features in the README: "Zero memory allocations during runtime".
Same issue with another independent library using llama.cpp: https://github.com/abetlen/llama-cpp-python
Seems to me this library is probably to blame.
Questions:
-
Does this mean GGML indeed does not ever load the whole model into RAM? That seems to me like a waste of RAM for users with high RAM.
-
Is the library just doing lots of reads to disk during inference?
-
Is there an easy option to let the library load the whole model into RAM and keep it there?
This thread makes it sound like the whole model is being loaded into RAM and then unloaded after inference, but that does not seem to be case, since I have 3GB model and RAM never goes above 1.2.
Here is the Colab to test it (keep an eye on resources during inference): https://colab.research.google.com/drive/1iGifBXEaXI2JDbJG1Il7BAS8gqVWm8kR