PainlessInferenceAcceleration
PainlessInferenceAcceleration copied to clipboard
Here `lookahead_generation` doesn't take `logits_warper` as input: https://github.com/alipay/PainlessInferenceAcceleration/blob/8015f12f7fe32acc102bb3eb51c4f8b3a420e79c/pia/lookahead/common/pretrained_model_batch.py#L426-L439 `logits_warper` is used in original `sample` to modify `next_tokens_scores`: https://github.com/alipay/PainlessInferenceAcceleration/blob/8015f12f7fe32acc102bb3eb51c4f8b3a420e79c/pia/lookahead/common/pretrained_model_batch.py#L474-L486 and to modifies logits by temperature, top_k, top_p... ```python if generation_config.temperature is...
After enabled repetition_penalty, will it lower lookahead's probability? If yes, any solution for avoiding the conflict?
AntRAG
Hey, I have been looking for the AntRAG dataset that you are using in your paper but could not find it anywhere. Could you provide me with a link to...
In the benchmark comparison results, could we add a comparison with VLLM to see the acceleration effects?
First, I would like to thank you so much for your contribution to the literature. I wanted to ask how is token verification implemented in your code, since it remains...
I read with great interest your paper 'Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy'. In essence, the paper proposes a tree data structure to...
I attempted to swap-in FlashAttention for batched llama, by simply changing `self._attn()` to `self._sdp_attn()` inside `LlamaAttention.forward()`: https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L372-L375 https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L404-L407 where `_sdp_attn` is defined as: https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L327-L329 However the model generates wrong result....
请问现在是否已经支持Qwen 1.5模型了?
I wanted to ask if there's a way to count how many forward passes/steps are done when using PAIN, to contrast it with standard decoding.
Consider to provide official CodeLlama inference speed up support.