web-llm
web-llm copied to clipboard
[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule
trafficstars
Overview
The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like generate() will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:
import * as webllm from "@mlc-ai/web-llm";
async function main() {
const chat = new webllm.ChatModule();
await chat.reload("Llama-2-7b-chat-hf-q4f32_1");
const completion = await chat.chat_completion({
messages: [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
// optional generative configs here
});
console.log(completion.choices[0]);
}
main();
If streaming:
const completion = await chat.chat_completion({
messages: [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
stream = true,
// optional generative configs here
});
for await (const chunk of completion) {
console.log(chunk.choices[0].delta.content);
}
Action items
- [x] O1: Implement the basic
chat_completion()(both streaming and non-streaming), support configs/features that we currently do not have insidellm_chat.ts- https://github.com/mlc-ai/web-llm/pull/298
- https://github.com/mlc-ai/web-llm/pull/300
- [ ] O2: Support function calling (
tools) - [ ] O3: Documentation and tests for the WebLLM repo
Existing gaps
There are some fields/features that are not yet supported in WebLLM compared to OpenAI's openai-node.
Fields in ChatCompletionRequest
model: in WebLLM, we need to callreload(model)instead of making it an argument inChatCompletionRequestresponse_format(json-formatting)- function calling related:
tool_choicetools
Fields in ChatCompletion respond
system_fingerprint: not applicable in our case (OpenAI needs it because they perform request remotely on servers)
Others
- We do not support
n > 1when streaming, sincellm_chat.tsdoes not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.
Future Items
- Support chat completion with image inputs (e.g. LLaVA), with Gradio frontend
- Add support for low-level APIs for post-forward logit processing
- Supported here: https://github.com/mlc-ai/web-llm/pull/277
- Support embedding models
- More modalities such as Audio
@CharlieFRuan Thanks for creating the tracking issue. Just wanted to let you know that @shreygupta2809 and I are currently working on supporting the function calling