Eric Buehler

Results 543 comments of Eric Buehler

Hi @otarkhan! We don't have GGUF support yet as I've been busy with some other models. Are you using ISQ?

Hi @otarkhan just to make sure, are you compiling with the `metal` feature? On my M3 max I'm getting 70+ T/s for that model.

@otarkhan yes, it is expected. If you increase the paged attention KV cache allocation dramatically, it'll cause macos to use non-GPU memory in my experience, which is a dramatic slowdown....

Hi @otarkhan - the best solution is to change the wired limit (non-pageable) memory allocated to the GPU. You can do this with `sudo sysctl -w iogpu.wired_limit_mb=(value in MB)`. I...

Hi @otarkhan! Before we reopen this, let's try one thing. According to [this comment](https://x.com/awnihannun/status/1882821315264164118), running the following should help: ``` sudo sysctl iogpu.disable_wired_collector=1 ``` Can you please try it?

@otarkhan I think it's finally fixed! #1506 was merged, and now I can use a large PA KV cache size on Metal. Can you please retry, this should work.

Hi @ljt019! That is super strange. Does your computer have a GPU, and if so, are you compiling for it? Taking multiple minutes for a model is definitely odd though....

Hi @ljt019 I can't reproduce this, can you please make sure it works after: - `cargo clean` - `git pull` `cargo run --features metal -- -i run -m ...`

Hu @matthewhaynesonline! Yes, I think that implementing ToSchema all the way down would probably be the right way to do it. Otherwise it looks like `utiopia` can't automatically do it....

@cschin Candle has Tensor::permute which should be able to do this, I think: > I am not sure if Candle has something similar to PyTorch's moviedom, so I have do...