Hanshi Sun

Results 25 comments of Hanshi Sun

Thanks! But how can I make it work? Do you have example command?

I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time.

Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question: I found that [fork of transformers](https://github.com/tjruwase/transformers/tree/kvcache-offload-cpu) actually allocate buffer for KV...

Thank you very much! Nice work!

Hello, Thanks for your interest in our work! In our provided implementation, we set $\gamma_1 = 1$ because we observed that the performance is nearly the same for $\gamma_1 =...

Hello, may I ask the memory for your device? You can try to decrease prefill from 124928 to 122880 to see if it is still OOM. The code can run...

What is your `transformers` version? Can you set it to `transformers==4.37.2` since `apply_rotary_pos_emb` api changes for recent versions?

Yeah I am using CUDA 12.1. Here is my `flash_attn` version. ``` >>> import torch >>> torch.__version__ '2.2.1+cu121' >>> import flash_attn >>> flash_attn.__version__ '2.5.7' ```

I have added a FAQ in the README :)