llama.clj Adjustments for b1321 of llama.cpp

Hey! Not sure if you're interested in this given that you beat me to it, however it might be interesting anyhow. I've been using your code for a while with some adjustments that were needed to keep up to date with https://github.com/ggerganov/llama.cpp

With regenerated API from the headers this gist works with https://github.com/ggerganov/llama.cpp/releases/tag/b1321

Some notable changes

Pre-allocate a llama_batch
Change to llama_token_to_piece (not sure if this 100% mimics your intended behavior but it works (with emoji :stuck_out_tongue_closed_eyes: ))
Probably the most interesting: use batch and decode instead of llama_eval which deprecated. Note this requires keeping track of n-past as llama_kv_cache_token_count is deprecated
Add reset? to samplef as my grammar sampler (basically just the one from the common.cpp ) requires a reset in my code
Some personal preference things while I was trying to understand the code (for example I wasn't really sure about the volatile! of the candidates-buf so I changed that)

Anyways hopefully it's useful :-)

Oct 05 '23 12:10 joelkuiper

Note that it's probably more better to not have the context hold the reference to llama_batch and instead allocate / free them dynamically otherwise implementing parallel decode is difficult. So it's a hack :shrug:

Edit: I also tried https://gist.github.com/joelkuiper/a7df2522a43b2870331577dcd116f198#file-llama-clj-L314-L369 but note that if you call llama_batch_free on the batch the JVM crashes. Which it appears to do often and not sure why.

Oct 05 '23 12:10 joelkuiper

These adjustments look great. The latest update makes llama.clj compatible with the latest llama.cpp, but it doesn't really take advantage of many of the improvements. I was mostly interested in making sure llama.clj could work with gguf models, but it would also be nice to start exposing some of the other new features and take advantage of the newer, more efficient APIs.

I do want to switch from llama_eval to batch decoding, especially since llama_eval is being deprecated
the new API already uses llama_token_to_piece, but I think there's probably a way to use it more efficiently.

Add reset? to samplef as my grammar sampler (basically just the one from the common.cpp ) requires a reset in my code

Right now, generate-tokens is the primitive that all of the other generators are built on top of. Instead, I'd like to add a generate-logits function and build everything else on top of that. That would deprecate :samplef which could be implemented in a backwards compatible way with something like:

(eduction
 (if samplef
   (map samplef)
   (mirostat-v2-sampler))
 (generate-logits))

Samplers that require state, initialization, and completion could be implemented as transducers which already support those requirements.

For example:

(eduction
 ;; sampler would be a transducer
 (my-sampler)
 (generate-logits ctx))

Adding generate-logits should support your reset? use case since transducers have an initialization step.

Some personal preference things while I was trying to understand the code (for example I wasn't really sure about the volatile! of the candidates-buf so I changed that)

I think I wasn't sure at the time if the candidate buf size could change size, but after learning more about how llama.cpp works, I think the size is static and the candidates-buf can just be preallocated like your code does.

Edit: I also tried https://gist.github.com/joelkuiper/a7df2522a43b2870331577dcd116f198#file-llama-clj-L314-L369 but note that if you call llama_batch_free on the batch the JVM crashes. Which it appears to do often and not sure why.

         delete-batch (fn []
                        (let [[old _new] (swap-vals! model-ptr (constantly nil))]
                          (when old
                            (llama/llama_batch_free ^llama_batch old))))

delete-batch is calling llama_batch_free on the model pointer instead of a batch pointer.

Oct 05 '23 18:10 phronmophobic

These adjustments look great. The latest update makes llama.clj compatible with the latest llama.cpp, but it doesn't really take advantage of many of the improvements. I was mostly interested in making sure llama.clj could work with gguf models, but it would also be nice to start exposing some of the other new features and take advantage of the newer, more efficient APIs.

Glad you found it useful and or interesting. Wasn't exactly sure how to approach it other than "here's some things I did with your code while I was trying to learn it"!

Right now, generate-tokens is the primitive that all of the other generators are built on top of. Instead, I'd like to add a generate-logits function and build everything else on top of that. That would deprecate :samplef which could be implemented in a backwards compatible way with something like:
(eduction
 (if samplef
   (map samplef)
   (mirostat-v2-sampler))
 (generate-logits))
Samplers that require state, initialization, and completion could be implemented as transducers which already support those requirements.

For example:
(eduction
 ;; sampler would be a transducer
 (my-sampler)
 (generate-logits ctx))
Adding generate-logits should support your reset? use case since transducers have an initialization step.

Yeah that makes a ton of sense to me. Also solves my confusion with the double call to get-logits (there's also get_logit_ith now mainly for this https://github.com/ggerganov/llama.cpp/blob/master/examples/batched/batched.cpp#L159)

Some personal preference things while I was trying to understand the code (for example I wasn't really sure about the volatile! of the candidates-buf so I changed that)

I think I wasn't sure at the time if the candidate buf size could change size, but after learning more about how llama.cpp works, I think the size is static and the candidates-buf can just be preallocated like your code does.

I wasn't sure either ... first time in a long time with JNA and C++ honestly... it's hard to keep up with the upstream though!

Edit: I also tried https://gist.github.com/joelkuiper/a7df2522a43b2870331577dcd116f198#file-llama-clj-L314-L369 but note that if you call llama_batch_free on the batch the JVM crashes. Which it appears to do often and not sure why.
         delete-batch (fn []
                        (let [[old _new] (swap-vals! model-ptr (constantly nil))]
                          (when old
                            (llama/llama_batch_free ^llama_batch old))))
delete-batch is calling llama_batch_free on the model pointer instead of a batch pointer.

Oops, thanks. I changed it anyways

(defn- by-reference [o v] (doto o (.setPointer (seq->memory v))))

(defn create-batch
  [^Memory batch-buf* num-batch-tokens n-past seq-id]
  (let [batch (doto (Structure/newInstance llama_batch) (.read))
        pos (int-array (map #(+ n-past %) (range num-batch-tokens)))
        seq-ids (int-array (repeat num-batch-tokens seq-id))
        logits (byte-array (conj (vec (repeat (dec num-batch-tokens) 0)) 1))]
    (doto batch
      (.writeField "n_tokens" (int num-batch-tokens))
      (.writeField "token" (doto (IntByReference.) (.setPointer batch-buf*)))
      (.writeField "pos" (by-reference (IntByReference.) pos))
      (.writeField "seq_id" (by-reference (IntByReference.) seq-ids))
      (.writeField "logits" (by-reference (ByteByReference.) logits))
      (.writeField "embd" nil))
    ;; I'm gonna assume the JVM is going to garbage collect these eventually, if not it leaks memory.
    batch))


(defn llama-eval
  [ctx batch seq-id n-past]
  (llama/llama_kv_cache_seq_rm ctx seq-id n-past -1)
  (if-let [res (llama/llama_decode ctx batch)]
    (assert (zero? res) (format "Failed to decode batch: %s"  res))
    batch))


(defn decode
  "Adds `s` to the current context and updates the context's logits (see `get-logits`)."
  [ctx s n-past* seq-id]
  (let [[total-tokens ^Memory tokens]
        (cond
          (string? s)
          (tokenize ctx s (zero? @n-past*))

          (integer? s)
          [1 [s]])
        ^Memory token-buf* (seq->memory tokens)]
    (assert (< @n-past* (:n-ctx ctx)) "Context size exceeded")
    (assert (< total-tokens (:n-ctx ctx)) "Input tokens exceeded context size")
    (let [batch-size (:n-batch ctx)]
      (loop [offset (long 0)
             n-past @n-past*]
        (let [batch-buf* (.share token-buf* (* offset Integer/BYTES))
              num-batch-tokens (min batch-size (- total-tokens offset))
              batch (create-batch batch-buf* num-batch-tokens n-past seq-id)
              next-offset (+ offset num-batch-tokens)]
          (llama-eval ctx batch n-past seq-id)
          (when (< next-offset total-tokens)
            (recur (long next-offset) (+ n-past num-batch-tokens))))))
    (vreset! n-past* (+ @n-past* total-tokens))
    ctx))

Thank you for all the amazing work btw, it really feels quite liberating to have an LLM this close to Clojure

Oct 05 '23 18:10 joelkuiper

Hello,

any updates on this? It would be nice to have llama.clj working with latest llama.cpp

Feb 20 '24 13:02 papadako

@papadako llama.clj has already been updated to a later version of llama.cpp since this issue was created.

I just pushed a branch that is updated to work with the latest llama.cpp, but it doesn't expose all the latest features from llama.cpp in llama.clj's high level API.

If you have a use case that requires functionality from the latest llama.cpp, can you create a separate issue with details about the use case?

Feb 20 '24 18:02 phronmophobic

Thank you @phronmophobic

I am using the add-bert branch and my simple examples seem to work as expected.

Thanks!

Feb 21 '24 14:02 papadako

@papadako , the latest release finally uses the new llama_decode api instead of llama_eval. I think I've also addressed all the other points except for adding a reset? flag to the sampler. As I mentioned, I think the reset? flag would be better implemented by deprecating :samplef and adding a generator for logits so that samplers could just be transducers and be composed with the other parts of the pipeline.

I'm going to close this issue for now, but feel free to open a new issue with any questions or features I've missed.

Jun 10 '24 19:06 phronmophobic

llama.clj llama.clj copied to clipboard

Adjustments for b1321 of llama.cpp

llama.clj
llama.clj copied to clipboard