LLM.swift
LLM.swift copied to clipboard
High latency
Describe the bug Inference latency seems to be a lot higher when using LLM Swift compared to when using it through LM Studio About x2 the latency to 1st token and 5 X latency per token. To Reproduce You must include minimal code that can reproduce the behavior, for example:
import SwiftUI
import LLM
class ChatBot: LLM {
convenience init() {
let url = Bundle.main.url(forResource: "gemma-2-2b-it-Q8_0", withExtension: "gguf")!
let systemPrompt = "you are helpful, highly intelligent assistant!"
self.init(from: url, template: .chatML(systemPrompt))
}
}
struct ChatView: View {
@ObservedObject var bot: ChatBot
@State var input = "Give me seven national flag emojis people use the most; You must include South Korea."
init(_ bot: ChatBot) { self.bot = bot }
func respond() { Task { await bot.respond(to: input) } }
func stop() { bot.stop() }
var body: some View {
VStack(alignment: .leading) {
ScrollView { Text(bot.output).monospaced() }
Spacer()
HStack {
ZStack {
RoundedRectangle(cornerRadius: 8).foregroundStyle(.thinMaterial).frame(height: 40)
TextField("input", text: $input).padding(8)
}
Button(action: respond) { Image(systemName: "paperplane.fill") }
Button(action: stop) { Image(systemName: "xmark") }
}
}.frame(maxWidth: .infinity).padding()
}
}
Expected behavior As both run on llama CCP I would expect the latency to be the same
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- Chip: [e.g. Apple M1]
- Memory: [e.g. 16GB]
- OS: [e.g. macOS 14.0]
Additional context Try to make the inference settings to be identical as well and it did not help latency was still significantly slower. Am I missing anything here?