Results 208 comments of setzer22

@danShumway I'm sorry, I left exwm a while ago. It's an awesome piece of software, but I needed something more stable for my day-to-day work :sweat_smile:. I can tell, though,...

I am also seeing a very similar behavior on Linux. It makes sense because: - Each time the game runs, the DLL is read from disk, so if there are...

@im-not-tom I got this working [on my project](https://github.com/setzer22/llama-rs/pull/14) and those are basically the steps I followed :+1: I have verified this works by saving memory, restoring on a different process...

This could be a discrepancy in size due to integer promotion rules / a potential overflow, since the sizes for 30B are gonna be larger. More liberal use of usize...

Pushed! Sorry about that :sweat_smile: It was just a typo on my end.

I was just able to load 30B with the changes on main, but I'll wait for others to confirm before closing the issue.

@RCasatta you mean the f16 version? Yes, I wasn't able to load that one on my machine (32GB). But I'm able to load the quantized one just fine. Anyway, closing...

Oh, before I forget. I also tried using the [snap](https://docs.rs/snap/latest/snap/) crate for compression. Some quick results: - Compression ratios for the prompt I'm sharing above look good. A cached prompt...

> Looks good to me. Would this also enable continuing an existing generation that ran out before its end-of-text? Not directly, but it would be a step in the right...

Alright, this was some major cleanup session! :smile: As discussed, I've broken down the old `infer_with_prompt` function into two functions: `feed_prompt` and `infer_next_token`. This helps untangle the mess the original...