Justine Tunney

Results 655 comments of Justine Tunney

Here's 453ms latency to generate a single token using a 14 gig model: ``` main jart@luna:~/llamafile$ rusage o//llama.cpp/main/main -m /weights/Mistral-7B-Instruct-v0.3.BF16.gguf --cli -n 1 --log-disable --temp 0 --special Question took 449,432µs...

83 seconds to load a 3.6gb file. Do you have a 5400 rpm disk connected over gigabit ethernet? You're going to pay that cost no matter what you do. If...

I don't know what provisioned concurrency is. But I'd assume that warmup removed, you would have some other system send the warmup request automatically, and then you'd block any user...

You have the opportunity to be the first person to productionize the brand new llamafile server v.2.0 that I'm working on. So far it has an `/embedding` endpoint. Embedding models...

Oh, there's also a tokenization endpoint: ``` jtunn@gothbox:~$ curl http://127.0.0.1:8080/tokenize?prompt=hello+world { "add_special": true, "parse_special": false, "tokens": [ "[CLS]", " hello", " world", "[SEP]" ] } ```

`o//llamafile/server/main` has to be built from source. In the future, it'll be called `llamafile --server`. But right now it's a separate binary that's independent of our releases. The current `llamafile...

Sure I can do that. I'll just add it as a flag, as you suggested, since I think doing the warmup is good in general.

Thanks for your patience. I've added the warmup flag. Let me know if there's any issues with it. As for the warmup endpoint, could you just try that by sending...

Also this change just got shipped in our 0.8.12 release. Enjoy!

I don't think automating sync is realistically going to happen. Upstream made a lot of changes we can't agree to, such as CUDA code size being too large, server having...