LLM.swift icon indicating copy to clipboard operation
LLM.swift copied to clipboard

High latency

Open ferdinandl007 opened this issue 6 months ago • 0 comments

Describe the bug Inference latency seems to be a lot higher when using LLM Swift compared to when using it through LM Studio About x2 the latency to 1st token and 5 X latency per token. To Reproduce You must include minimal code that can reproduce the behavior, for example:

import SwiftUI
import LLM


class ChatBot: LLM {
    convenience init() {
        let url = Bundle.main.url(forResource: "gemma-2-2b-it-Q8_0", withExtension: "gguf")!
        let systemPrompt = "you are helpful, highly intelligent assistant!"
        self.init(from: url, template: .chatML(systemPrompt))
    }
}

struct ChatView: View {
    @ObservedObject var bot: ChatBot
    @State var input = "Give me seven national flag emojis people use the most; You must include South Korea."
    init(_ bot: ChatBot) { self.bot = bot }
    func respond() { Task { await bot.respond(to: input) } }
    func stop() { bot.stop() }
    var body: some View {
        VStack(alignment: .leading) {
            ScrollView { Text(bot.output).monospaced() }
            Spacer()
            HStack {
                ZStack {
                    RoundedRectangle(cornerRadius: 8).foregroundStyle(.thinMaterial).frame(height: 40)
                    TextField("input", text: $input).padding(8)
                }
                Button(action: respond) { Image(systemName: "paperplane.fill") }
                Button(action: stop) { Image(systemName: "xmark") }
            }
        }.frame(maxWidth: .infinity).padding()
    }
}

Expected behavior As both run on llama CCP I would expect the latency to be the same

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Chip: [e.g. Apple M1]
  • Memory: [e.g. 16GB]
  • OS: [e.g. macOS 14.0]

Additional context Try to make the inference settings to be identical as well and it did not help latency was still significantly slower. Am I missing anything here?

ferdinandl007 avatar Aug 21 '24 04:08 ferdinandl007