PainlessInferenceAcceleration issues

lookahead with do_sample=True does not take temperature, top_k, top_p

2

Here `lookahead_generation` doesn't take `logits_warper` as input: https://github.com/alipay/PainlessInferenceAcceleration/blob/8015f12f7fe32acc102bb3eb51c4f8b3a420e79c/pia/lookahead/common/pretrained_model_batch.py#L426-L439 `logits_warper` is used in original `sample` to modify `next_tokens_scores`: https://github.com/alipay/PainlessInferenceAcceleration/blob/8015f12f7fe32acc102bb3eb51c4f8b3a420e79c/pia/lookahead/common/pretrained_model_batch.py#L474-L486 and to modifies logits by temperature, top_k, top_p... ```python if generation_config.temperature is...

learning-chip

Do lookahead and repetition_penalty conflict?

1

After enabled repetition_penalty, will it lower lookahead's probability? If yes, any solution for avoiding the conflict?

zhanweiw

AntRAG

1

Hey, I have been looking for the AntRAG dataset that you are using in your paper but could not find it anywhere. Could you provide me with a link to...

nrmer

How the performance VS vLLM inference（vLLM vs Lookahead）

5

In the benchmark comparison results, could we add a comparison with VLLM to see the acceleration effects?

buptygz

How is verification done in PAIN?

First, I would like to thank you so much for your contribution to the literature. I wanted to ask how is token verification implemented in your code, since it remains...

jivanph

In the benchmark studies, how are the draft tokens generated?

9

I read with great interest your paper 'Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy'. In essence, the paper proposes a tree data structure to...

jivanph

Changing naive attention to SDPA gives wrong result for batched llama example

3

I attempted to swap-in FlashAttention for batched llama, by simply changing `self._attn()` to `self._sdp_attn()` inside `LlamaAttention.forward()`: https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L372-L375 https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L404-L407 where `_sdp_attn` is defined as: https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L327-L329 However the model generates wrong result....

learning-chip