Muhammad Daniyal

Results 9 comments of Muhammad Daniyal

Yeah same error with OpenAIChat and `zero-shot-description`

Hi, same question. What exactly is the `model_draft_name`? If I would like to speedup [llama 7b](https://huggingface.co/meta-llama/Llama-2-7b), and I have stored the checkpoints in my local let's say in folder name...

@andreas-solti thanks for the response. I watched the video and understood that it is basically the smaller model which should be relatively fast, so the actual model only has to...

@parthsarthi03 I'm facing the same issue. Any update on this? @chenyujiang11 @ExtReMLapin did you guys able to resolve this?

Thanks for the response, got it.

I got it to work using the `AsyncLLMEngine` class. ``` from vllm import AsyncLLMEngine, AsyncEngineArgs engine_args = AsyncEngineArgs( model=model, quantization=quant, enforce_eager=enforce_eager, tensor_parallel_size=tensor_parallel_size, enable_relay_attention=enable_relay_attention, sys_prompt=sys_prompt, sys_schema=sys_schema, sys_prompt_file=sys_prompt_file, sys_schema_file=sys_schema_file ) self.engine =...

Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say `prompts = prompts * 32` and then...

> > Got it, thank you @rayleizhu for the response. I'll test it with many prompts. However, one concern arises: when I execute let's say `prompts = prompts * 32`...

Hi, @rayleizhu I tried speed calculation both using `enable_relay_attention = True` and `enable_relay_attention = False` I can see the difference. But I want to understand, is this the correct way...