Yao Zhao
Yao Zhao
Thank you, we will fix it soon.
To Q1: The draft tokens are generated from a cached trie tree (each node is a token id). Currently the trie tree is constructed from prompts and responses on-the-fly, therefore...
We choose tokens not only from responses, but also from prompts.
In the benchmark, we first generate responses for samples from dev set, and put the responses into a global trie tree, then we evaluate each prompt in the test set(...
Lines from [here](https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/common/pretrained_model.py#L807) to [here](https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/common/pretrained_model.py#L861) are used for vefification of tree drafts. Our `lookahead` decoding can not generate exactly the same response as the generation mode `SAMPLE` in transformers, due...
AntRAG is a proprietary RAG dataset utilized by our company, comprising queries and relevant documents retrieved from our app. Unfortunately, it cannot be shared externally due to privacy concerns.
It indeed may lower the speedup by about 5%-10%. A sufficient warmup could ease the negative effect.
You can count the steps with two methods, one is turning on the debug_lookahead, it will output debug info of each step and you can count the steps manually, the...
Should be sum(kwargs['dls'])-len(kwargs['dls']), because the decoding_length(i.e., `dls`) is compose of the next token and draft tokens, we should minus 1.