Lvjinhong
Lvjinhong
> hmm, what gpu do you use? Normally most people will add --medvram or --xformers (for some gpus) to allow it to run on 6 or even 4GB of VRAM....
 -medvram will look like this, I think I can try the previous version
> @beginlner thanks for the info. Reading https://github.com/microsoft/DeepSpeed-Kernels/blob/main/dskernels/inf_flash_attn/blocked_flash/flash_fwd_kernel.h as well. So far, is there any progress on enabling speculative decoding for vLLM? Additionally, I'm wondering if the implementation of this...
When can this branch be merged? In the version I am currently using, there is: ``` op=xops.fmha.MemoryEfficientAttentionFlashAttentionOp[0] if (is_hip()) else None, ``` Is the Flash operation supported only for HIP?
Very good work, may I ask if it can be merged into the main branch soon
> I think this is not TGI better, but vllm result are some sort miss aligned with huggingface's transformers. > > Not sure its a bug or a feature, but...
I seem to have the same issue, and I've been waiting for about ten minutes, and it's still the same. Seeing this returned error, it shouldn't be my problem, right?...
in github,like this: 
嗯好吧,在linux里面 需要先mv成7z后缀,然后用7z x就可以解压了
I've tried installing flash-attn using pip install flash-attn==2.2.1 and flash-attn==2.3. It can be seen that the installation was ultimately successful. However, when I attempt distributed training with Megatron LM, I...