Yao Zhao comments

Results 13 comments of


                                            Yao Zhao

lookahead with do_sample=True does not take temperature, top_k, top_p

Thank you, we will fix it soon.

In the benchmark studies, how are the draft tokens generated?

To Q1: The draft tokens are generated from a cached trie tree (each node is a token id). Currently the trie tree is constructed from prompts and responses on-the-fly, therefore...

In the benchmark studies, how are the draft tokens generated?

We choose tokens not only from responses, but also from prompts.

In the benchmark studies, how are the draft tokens generated?

In the benchmark, we first generate responses for samples from dev set, and put the responses into a global trie tree, then we evaluate each prompt in the test set(...

In the benchmark studies, how are the draft tokens generated?

Lines from [here](https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/common/pretrained_model.py#L807) to [here](https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/common/pretrained_model.py#L861) are used for vefification of tree drafts. Our `lookahead` decoding can not generate exactly the same response as the generation mode `SAMPLE` in transformers, due...

In the benchmark studies, how are the draft tokens generated?

Yes

AntRAG

AntRAG is a proprietary RAG dataset utilized by our company, comprising queries and relevant documents retrieved from our app. Unfortunately, it cannot be shared externally due to privacy concerns.

Do lookahead and repetition_penalty conflict?

It indeed may lower the speedup by about 5%-10%. A sufficient warmup could ease the negative effect.

Counting how many forward passes/steps were done when using PAIN

You can count the steps with two methods, one is turning on the debug_lookahead, it will output debug info of each step and you can count the steps manually, the...

Counting how many forward passes/steps were done when using PAIN

Should be sum(kwargs['dls'])-len(kwargs['dls']), because the decoding_length(i.e., `dls`) is compose of the next token and draft tokens, we should minus 1.