LocalAI feature: unbuffered token stream

Now this should be quite easy at least for the llama.cpp backend: https://github.com/go-skynet/go-llama.cpp/pull/28 thanks to @noxer's contribution ( :heart: ) now it's just a matter of wiring things up in the SSE callback here in the server

[x] go-llama.cpp
[ ] gpt4all.cpp
[ ] gpt2.cpp
[x] rwkv.cpp

Apr 27 '23 16:04 mudler

the go-gpt4all-j backend has support as well for unbuffered token stream, so we should (almost) all good as 2 out of 3 backends supports it till now

May 01 '23 18:05 mudler

implemented for llama.cpp backend!

May 02 '23 20:05 mudler

This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks!

May 03 '23 11:05 apiad

This is the most exciting feature for my use case! I'm wondering, do you have already planned how the API will support this? Thanks!

LocalAI is following the OpenAI specs. Therefore tokens are pushed via SSE (server-sent events) streams. It currently works already for llama.cpp models, like: vicuna, alpaca, wizardlm, and alikes.

May 06 '23 08:05 mudler

I am getting memory issues with LocalAI on 16 GiB of RAM even in the smaller chat model.

This only happens on "stream: true". I guess this is related to buffering being enabled in the request?

If I make a single POST request without "stream: true", the request works properly and I receive a proper answer.

I am using GPT4All models.

Jun 10 '23 13:06 EchedelleLR

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful :smile_cat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

To fix the issue, we need to update the llamaOpts variable with the correct options based on the given model options. Here's an updated version of the code that should work for all three backends:

import (
    "fmt"

    "github.com/go-skynet/LocalAI/pkg/grpc/base"
    pb "github.com/go-skynet/LocalAI/pkg/grpc/proto"
    llama "github.com/go-skynet/go-llama.cpp"
)
type LLM struct {
    base.Base
    llama *llama.LLama
}
func (llm *LLM) Load(opts *pb.ModelOptions) error {
    llamaOpts := []llama.ModelOption{}
    if opts.ContextSize != 0 {
        llamaOpts = append(llamaOpts, llama.SetContext(int(opts.ContextSize)))
    }
    if opts.F16Memory {
        llamaOpts = append(llamaOpts, llama.EnableF16Memory)
    }
    if opts.Embeddings {
        llamaOpts = append(llamaOpts, llama.EnableEmbeddings)
    }
    if opts.NGPULayers != 0 {
        llamaOpts = append(llamaOpts, llama.SetGPULayers(int(opts.NGPULayers)))
    }
    switch backend := llama.GetBackend(); backend {
    case llama.BackendCUDA:
        if llama.IsNVIDIA() {
            llamaOpts = append(llamaOpts, llama.EnableTensorRT)
        }
    case llama.BackendOpenCL:
        llamaOpts = append(llamaOpts, llama.SetPlatform("opencl"))
    case llama.BackendVulkan:
        llamaOpts = append(llamaOpts, llama.SetPlatform("vulkan"))
    }
    llm.llama.SetOptions(llamaOpts...)
    return nil
}

This code checks the backend and adds any necessary options based on the given model options. It also uses the GetBackend() method to determine the current backend being used, which is needed for some of the llama options like EnableTensorRT.

Sources:

https://github.com/go-skynet/LocalAI/tree/master/Makefile
https://github.com/go-skynet/LocalAI/tree/master/go.sum
https://github.com/go-skynet/LocalAI/tree/master/pkg/grpc/llm/llama/llama.go
https://github.com/go-skynet/LocalAI/tree/master/pkg/grpc/llm/llama-grammar/llama.go

Jul 19 '23 23:07 localai-bot

LocalAI LocalAI copied to clipboard

feature: unbuffered token stream

:warning::warning::warning::warning::warning:

:warning::warning::warning::warning::warning:

LocalAI
LocalAI copied to clipboard