llama.clj
llama.clj copied to clipboard
Adjustments for b1321 of llama.cpp
Hey! Not sure if you're interested in this given that you beat me to it, however it might be interesting anyhow. I've been using your code for a while with some adjustments that were needed to keep up to date with https://github.com/ggerganov/llama.cpp
With regenerated API from the headers this gist works with https://github.com/ggerganov/llama.cpp/releases/tag/b1321
Some notable changes
- Pre-allocate a llama_batch
- Change to llama_token_to_piece (not sure if this 100% mimics your intended behavior but it works (with emoji :stuck_out_tongue_closed_eyes: ))
- Probably the most interesting: use batch and decode instead of llama_eval which deprecated. Note this requires keeping track of n-past as llama_kv_cache_token_count is deprecated
- Add reset? to samplef as my grammar sampler (basically just the one from the common.cpp ) requires a reset in my code
- Some personal preference things while I was trying to understand the code (for example I wasn't really sure about the volatile! of the candidates-buf so I changed that)
Anyways hopefully it's useful :-)
Note that it's probably more better to not have the context hold the reference to llama_batch and instead allocate / free them dynamically otherwise implementing parallel decode is difficult. So it's a hack :shrug:
Edit: I also tried https://gist.github.com/joelkuiper/a7df2522a43b2870331577dcd116f198#file-llama-clj-L314-L369 but note that if you call llama_batch_free on the batch the JVM crashes. Which it appears to do often and not sure why.
These adjustments look great. The latest update makes llama.clj compatible with the latest llama.cpp, but it doesn't really take advantage of many of the improvements. I was mostly interested in making sure llama.clj could work with gguf models, but it would also be nice to start exposing some of the other new features and take advantage of the newer, more efficient APIs.
- I do want to switch from
llama_eval
to batch decoding, especially sincellama_eval
is being deprecated - the new API already uses
llama_token_to_piece
, but I think there's probably a way to use it more efficiently.
Add reset? to samplef as my grammar sampler (basically just the one from the common.cpp ) requires a reset in my code
Right now, generate-tokens
is the primitive that all of the other generators are built on top of. Instead, I'd like to add a generate-logits
function and build everything else on top of that. That would deprecate :samplef
which could be implemented in a backwards compatible way with something like:
(eduction
(if samplef
(map samplef)
(mirostat-v2-sampler))
(generate-logits))
Samplers that require state, initialization, and completion could be implemented as transducers which already support those requirements.
For example:
(eduction
;; sampler would be a transducer
(my-sampler)
(generate-logits ctx))
Adding generate-logits
should support your reset?
use case since transducers have an initialization step.
Some personal preference things while I was trying to understand the code (for example I wasn't really sure about the volatile! of the candidates-buf so I changed that)
I think I wasn't sure at the time if the candidate buf size could change size, but after learning more about how llama.cpp works, I think the size is static and the candidates-buf can just be preallocated like your code does.
Edit: I also tried https://gist.github.com/joelkuiper/a7df2522a43b2870331577dcd116f198#file-llama-clj-L314-L369 but note that if you call llama_batch_free on the batch the JVM crashes. Which it appears to do often and not sure why.
delete-batch (fn []
(let [[old _new] (swap-vals! model-ptr (constantly nil))]
(when old
(llama/llama_batch_free ^llama_batch old))))
delete-batch
is calling llama_batch_free
on the model pointer instead of a batch pointer.
These adjustments look great. The latest update makes llama.clj compatible with the latest llama.cpp, but it doesn't really take advantage of many of the improvements. I was mostly interested in making sure llama.clj could work with gguf models, but it would also be nice to start exposing some of the other new features and take advantage of the newer, more efficient APIs.
Glad you found it useful and or interesting. Wasn't exactly sure how to approach it other than "here's some things I did with your code while I was trying to learn it"!
Right now,
generate-tokens
is the primitive that all of the other generators are built on top of. Instead, I'd like to add agenerate-logits
function and build everything else on top of that. That would deprecate:samplef
which could be implemented in a backwards compatible way with something like:(eduction (if samplef (map samplef) (mirostat-v2-sampler)) (generate-logits))
Samplers that require state, initialization, and completion could be implemented as transducers which already support those requirements.
For example:
(eduction ;; sampler would be a transducer (my-sampler) (generate-logits ctx))
Adding
generate-logits
should support yourreset?
use case since transducers have an initialization step.
Yeah that makes a ton of sense to me. Also solves my confusion with the double call to get-logits
(there's also get_logit_ith
now mainly for this https://github.com/ggerganov/llama.cpp/blob/master/examples/batched/batched.cpp#L159)
Some personal preference things while I was trying to understand the code (for example I wasn't really sure about the volatile! of the candidates-buf so I changed that)
I think I wasn't sure at the time if the candidate buf size could change size, but after learning more about how llama.cpp works, I think the size is static and the candidates-buf can just be preallocated like your code does.
I wasn't sure either ... first time in a long time with JNA and C++ honestly... it's hard to keep up with the upstream though!
Edit: I also tried https://gist.github.com/joelkuiper/a7df2522a43b2870331577dcd116f198#file-llama-clj-L314-L369 but note that if you call llama_batch_free on the batch the JVM crashes. Which it appears to do often and not sure why.
delete-batch (fn [] (let [[old _new] (swap-vals! model-ptr (constantly nil))] (when old (llama/llama_batch_free ^llama_batch old))))
delete-batch
is callingllama_batch_free
on the model pointer instead of a batch pointer.
Oops, thanks. I changed it anyways
(defn- by-reference [o v] (doto o (.setPointer (seq->memory v))))
(defn create-batch
[^Memory batch-buf* num-batch-tokens n-past seq-id]
(let [batch (doto (Structure/newInstance llama_batch) (.read))
pos (int-array (map #(+ n-past %) (range num-batch-tokens)))
seq-ids (int-array (repeat num-batch-tokens seq-id))
logits (byte-array (conj (vec (repeat (dec num-batch-tokens) 0)) 1))]
(doto batch
(.writeField "n_tokens" (int num-batch-tokens))
(.writeField "token" (doto (IntByReference.) (.setPointer batch-buf*)))
(.writeField "pos" (by-reference (IntByReference.) pos))
(.writeField "seq_id" (by-reference (IntByReference.) seq-ids))
(.writeField "logits" (by-reference (ByteByReference.) logits))
(.writeField "embd" nil))
;; I'm gonna assume the JVM is going to garbage collect these eventually, if not it leaks memory.
batch))
(defn llama-eval
[ctx batch seq-id n-past]
(llama/llama_kv_cache_seq_rm ctx seq-id n-past -1)
(if-let [res (llama/llama_decode ctx batch)]
(assert (zero? res) (format "Failed to decode batch: %s" res))
batch))
(defn decode
"Adds `s` to the current context and updates the context's logits (see `get-logits`)."
[ctx s n-past* seq-id]
(let [[total-tokens ^Memory tokens]
(cond
(string? s)
(tokenize ctx s (zero? @n-past*))
(integer? s)
[1 [s]])
^Memory token-buf* (seq->memory tokens)]
(assert (< @n-past* (:n-ctx ctx)) "Context size exceeded")
(assert (< total-tokens (:n-ctx ctx)) "Input tokens exceeded context size")
(let [batch-size (:n-batch ctx)]
(loop [offset (long 0)
n-past @n-past*]
(let [batch-buf* (.share token-buf* (* offset Integer/BYTES))
num-batch-tokens (min batch-size (- total-tokens offset))
batch (create-batch batch-buf* num-batch-tokens n-past seq-id)
next-offset (+ offset num-batch-tokens)]
(llama-eval ctx batch n-past seq-id)
(when (< next-offset total-tokens)
(recur (long next-offset) (+ n-past num-batch-tokens))))))
(vreset! n-past* (+ @n-past* total-tokens))
ctx))
Thank you for all the amazing work btw, it really feels quite liberating to have an LLM this close to Clojure
Hello,
any updates on this? It would be nice to have llama.clj working with latest llama.cpp
@papadako llama.clj has already been updated to a later version of llama.cpp since this issue was created.
I just pushed a branch that is updated to work with the latest llama.cpp, but it doesn't expose all the latest features from llama.cpp in llama.clj's high level API.
If you have a use case that requires functionality from the latest llama.cpp, can you create a separate issue with details about the use case?
Thank you @phronmophobic
I am using the add-bert branch and my simple examples seem to work as expected.
Thanks!
@papadako , the latest release finally uses the new llama_decode api instead of llama_eval. I think I've also addressed all the other points except for adding a reset? flag to the sampler. As I mentioned, I think the reset?
flag would be better implemented by deprecating :samplef and adding a generator for logits so that samplers could just be transducers and be composed with the other parts of the pipeline.
I'm going to close this issue for now, but feel free to open a new issue with any questions or features I've missed.