node-llama-cpp
node-llama-cpp copied to clipboard
feat: Apply different LoRA dynamically
Feature Description
Can change LoRA dynamically after loading LLaMa model.
The Solution
See llama_model_apply_lora_from_file()
function in llama.cpp
.
https://github.com/ggerganov/llama.cpp/blob/e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb/llama.h#L353C1-L359C1
Considered Alternatives
None.
Additional Context
No response
Related Features to This Feature Request
- [ ] Metal support
- [ ] CUDA support
- [ ] Grammar
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.
@snowyu Can you please provide an example of a proposed usage with llama.cpp
showing how you would like to use it?
Please provide links to files that you use and what you are generally trying to achieve.
I want to keep the API of this library relatively high-level while still offering advanced capabilities, so I wouldn't necessarily want to expose the llama_model_apply_lora_from_file
function as-is, and instead understand your use case better to try to figure out what's the best way to provide support for this.
Usually the base LLM model is more than 4GB in size. The corresponding LoRA adapters are relatively small: about a few hundred megabytes.
If there is LoRA dynamic loading, several LoRAs fine-tuned under the same basic model can be quickly switched in memory. There is no need to use multiple full huge LLM models.
// pseudocode
llama_model * model = llama_load_model_from_file('ggml-base-model-f16.bin', mparams);
...
// switch to the Animal Domain LoRA adapter
int err = llama_model_apply_lora_from_file(model,
"animal-lora-adapter.bin",
lora_scale,
NULL, //<--- optional lora_base model
params.n_threads);
// switch to the Astronomy Domain LoRA adapter
int err = llama_model_apply_lora_from_file(model,
"animal-lora-adapter.bin",
lora_scale,
NULL, //<--- optional lora_base model
params.n_threads);
lora_base
Note: When using a quantized model, the quality may suffer. To avoid this, specify a f16/f32 model with lora_base to use as a base. The layers modified by LoRA adapter will be applied to the lora_base model and then quantized to the same format as the base model. Layers not modified by the LoRA adapter will remain untouched.
Just curious if there has been any progress on this?
I think it would be nice to be able to specify a LoRa adapter in the LlamaModelOptions
or the be able to call a method on LlamaModel
.
If that makes sense, I'd be willing to start looking into it.
@vlamanna The beta of version 3 is now mature enough, so I've added support for loading a LoRA adapter as part of loading a model (#217); set the lora
option on llama.loadModel({ ... })
to use it.
@snowyu Changing a LoRA on a model at runtime is not possible at the moment, as there's no way to unload an adapter after it has been applied to a model; every call to llama_model_apply_lora_from_file
loads another adapter onto the current model state.
This feature will be available in the next beta version that I'll release soon.
:tada: This issue has been resolved in version 3.0.0-beta.20 :tada:
The release is available on:
- npm package (@beta dist-tag)
- GitHub release
-
v3.0.0-beta.20
Your semantic-release bot :package::rocket:
@snowyu Changing a LoRA on a model at runtime is not possible at the moment, as there's no way to unload an adapter after it has been applied to a model; every call to
llama_model_apply_lora_from_file
loads another adapter onto the current model state.
@giladgd It can not be done on the lowlevel API. But It could be ok on high level API, like this:
// pseudocode
class LlamaModel {
loadLoRAs(loraFiles, scale, threads, baseModelPath?) {
let needDeinit = false
// check loaded loraModels
for (let i=0; i< this.loraModels.length; i++) {
const loraModel = this.loraModels[i]
const ix = loraFiles.indexOf(loraModel.file)
if (ix === -1) {needDeinit=true;break}
loraFiles.splice(ix, 1)
}
//free model if already load other lora
if (needDeinit) {
// deinit and load base model again
this.reloadModel()
}
for (const loraFile of loraFiles) {
const model = _loadLoRA(loraFile, scale, threads, baseModelPath)
if (model) this.loraModels.push(model)
}
}
@snowyu It can be done with the high-level API that I've added. Providing an API to modify the current loaded model via changing the LoRAs applied onto it on runtime is not preferable since all dependent states (such as contexts) should also be reloaded when switching LoRAs, so since there's no performance benefit to doing that (as unloading a LoRA is not possible to do at low level) then exposing such an API is not worth it as it would only make the usage of this library more complicated.
@giladgd If you only consider the creation of APIs from the perspective of performance, this is indeed the case. But from the perspective of ease of use, it is worth exploring.
Let me talk about my usage scenario, a simple intelligent agent script engine, they can call each other, each agent may use a different LLM. LLM reloading is commonplace. My current pain is that I have to These LLMs are managed in the agent script engine:
- Determine LLM’s caching strategy based on memory size
- Determine whether to switch LLM or run multiple LLMs simultaneously (based on maximum VRAM and RAM optimal parameters)
- Managing and maintaining recommended configurations of LLM
Although these should be the responsibility of the LLM engine and not the agent script engine.
@snowyu We have plans to make the memory management transparent, so you can focus on what you'd like to do with models, and node-llama-cpp
will offload and reload things back to memory as needed so you can achieve everything you'd like to do without managing any memory at all, and in the most performant way possible with the current hardware.
Over the past few months, I've laid the infrastructure for building such a mechanism, but there's still work to do to achieve this. This feature will be released as a non-breaking feature after the version 3 stable release (that's coming very soon), due to it taking much longer than I initially anticipated.
Perhaps you've noticed, for example, that you don't have to specify gpuLayers
when loading a model and contextSize
when creating a context anymore, as node-llama-cpp
measures the current hardware and estimates how much resources things will consume to find the optimal balance between many parameters, while maximizing the performance of each one up to the limits of the hardware.
This is part of the effort to achieve seamless memory management and default zero-config.
Allowing to modify a model state at runtime on the library level will make using this library more complicated (due to all of the hassle it incurs to keep things working or the performance tradeoffs it embodies), and I think is a lacking solution to the memory management hassle that I work on solving from its root.
@giladgd
We have plans to make the memory management transparent, so you can focus on what you'd like to do with models, and node-llama-cpp will offload and reload things back to memory as needed so you can achieve everything you'd like to do without managing any memory at all, and in the most performant way possible with the current hardware.
It's great.looking forward to it.
Perhaps you've noticed, for example, that you don't have to specify gpuLayers when loading a model and contextSize when creating a context anymore.
Yes. I have. Do you think about adding the estimate of memory for mmprojector model?
Allowing to modify a model state at runtime on the library level will make using this library more complicated (due to all of the hassle it incurs to keep things working or the performance tradeoffs it embodies), and I think is a lacking solution to the memory management hassle that I work on solving from its root.
Totally agree.
@snowyu
Yes. I have. Do you think about adding the estimate of memory for mmprojector model?
I don't know what model you are referring to.
I reverse-engineered llama.cpp
to figure out how to estimate resource requirements using only the metadata of a model file without actually loading it; it isn't perfect, but the estimation is pretty close to the actual usage with many models I've tested this on.
To find out how accurate the estimation is for a given model, you can run this command:
npx node-llama-cpp@beta inspect measure <model path>
If you notice that the estimation is way off for some model and want to fix it, you can look at the llama.cpp
codebase to figure out the differences in how memory is allocated for this model, and open a PR on node-llama-cpp
to update the estimation algorithms on GgufInsights
.
@giladgd Sorry, I've been busy with my project lately. mmprojector is from Multimodal LLM, maybe you haven't used the llava part yet.
You may be interested in the Programmable Prompt Engine project I'm working on.
I hope to add node-llama-cpp as the default provider in the near future, but for now, I don’t see a good API entry point to start. I need a simple API:
// come from https://github.com/isdk/ai-tool.js/blob/main/src/utils/chat.ts
export const AITextGenerationFinishReasons = [
'stop', // model generated stop sequence
'length', // model generated maximum number of tokens
'content-filter', // content filter violation stopped the model
'tool-calls', // model triggered tool calls
'abort', // aborted by user or timeout for stream
'error', // model stopped because of an error
'other', null, // model stopped for other reasons
] as const
export type AITextGenerationFinishReason = typeof AITextGenerationFinishReasons[number]
export interface AIResult<TValue = any, TOptions = any> {
/**
* The generated value.
*/
content?: TValue;
/**
* The reason why the generation stopped.
*/
finishReason?: AITextGenerationFinishReason;
options?: TOptions
/**
* for stream mode
*/
stop?: boolean
taskId?: AsyncTaskId; // for stream chunk
}
// https://github.com/isdk/ai-tool-llm.js/blob/main/src/llm-settings.ts
export enum AIModelType {
chat, // text to text
vision, // image to text
stt, // audio to text
drawing, // text to image
tts, // text to audio
embedding,
infill,
}
// fake API
class AIModel {
llamaLoadModelOptions: LlamaLoadModelOptions
supports: AIModelType|AIModelType[]
options: LlamaModelOptions // default options
static async loadModel(filename: string, options?: {aborter?: AbortController, onLoadProgress, ...} & LlamaLoadModelOptions): Promis<AIModel>;
async completion(prompt: string, options?: {stream?: boolean, aborter?: AbortController,...} & LlamaModelOptions): Promise<AIResult|ReadStream<AIResult>>
fillInMiddle...
tokenize...
detokenize...
}
:tada: This PR is included in version 3.0.0 :tada:
The release is available on:
Your semantic-release bot :package::rocket: