What does this PR do?

WIP

Mar 16 '23 18:03 gante

The documentation is not available anymore as the PR was closed or merged.

Mar 16 '23 18:03 HuggingFaceDocBuilderDev

@amyeroberts @sgugger -- since this PR is a bit more complex than most, I've decided to request a review from you two 🤗

Apr 17 '23 18:04 gante

@amyeroberts regarding splitting up, I totally agree! And not only on this method but on most parts of GenerationMixin. Not only are the functions long, but they reuse a significant part of the logic. I want to address that in the near future, by designing a .generate() that can be somehow composed of a sequence of smaller functional blocks. I haven't figured out the deets, but I'd expect that a good implementation would get us better readability, less code duplication, and higher flexibility for HW/model/decoding-specific implementations! 💅

Before merging, I'm going to double-check that the current code keeps the performance numbers I got a few weeks ago. If everything goes well, it will be merged today 🙏

Apr 18 '23 15:04 gante

@gante Excellent work! I just dive into the code these days and found that the impl only support batchsize 1. Speculative Decoding have no relative to batchsize. I guess supporting bs >1 will be more hard to impl so you just support bs=1 firstly?

Another question is about the decision of whether the candidate tokens generated by draft model be accepted or not. The process n_matches is not the same as Google or DeepMind's paper. I found an impl of DeepMind's algrothm. Could you please explain it with more detail? I have the Thanks in advance.

Oct 31 '23 11:10 zhaoyang-star

@zhaoyang-star thank you for the kind words :)

Re batch size 1: it was a mix of implementation simplicity and diminishing returns. Since transformers works with batched inputs with fixed length, efficiently applying assisted generation/speculative decoding would necessarily mean applying extra logic to realign the tensors (e.g. row 1 might get 5 speculated tokens, but row 2 only gets 2 -- row 2 would need to be left-padded to continue). Moving to nested tensors will get us rid of this limitation :)

Re implementation differences: the two techniques were developed independently, despite relying on the same principle (saving GPU memory bandwidth with the aid of a smaller model). To put it plainly:

Speculative Decoding is better when sampling is active with temperatures above 0.3-0.4 -- it employs a clever mathematical trick to handle decoding mismatches. However, you must define how many tokens you want to fetch from the smaller model.
Assisted Generation (our implementation) is better in the other scenarios because it has a dynamic heuristic to decide how many tokens to fetch from the assistant model, based on the assistant hit ratio. This means it can adapt according to the difficulty of the prompt, with additional no user input.

For the record, we will be adding the sampling trick to our implementation soon, so it will be the best of both worlds :)

Oct 31 '23 15:10 gante

@gante Thanks for your reply.

Speculative Decoding is better when sampling is active with temperatures above 0.3-0.4 -- it employs a clever mathematical trick to handle decoding mismatches. However, you must define how many tokens you want to fetch from the smaller model.

How to get the conclusion that Speculative Decoding is better when sampling is active with temperatures above 0.3-0.4, and Assisted Generation is better in other scenarios? If the conclusion is right, is it better that we implement both the two methods and decide to execute it according to the vaule of temperature?

BTW, Assisted Generation is much easier to understand than Speculative Decoding. So I perfer to use Assisted Generation.

Nov 15 '23 01:11 zhaoyang-star

@zhaoyang-star The conclusion is empirical, with the 0.3-0.4 being a personal rule of thumb based on my assisted generation tests and the values reported in the speculative decoding paper 🤗 It certainly depends on the model and on the task itself.

After we merge the mathematical trick from speculative decoding, calling assisted_generation will actually be the best of both worlds -- it will use the mathematical trick from speculative decoding AND apply the heuristic to determine the number of candidate tokens from assisted generation, all without additional parameterization!

Nov 15 '23 15:11 gante

@gante Thanks a lot. Can't waiting to try the merged version. I saw https://github.com/huggingface/transformers/pull/27270/ is relative to speculative decoding.

Nov 16 '23 03:11 zhaoyang-star

@gante Have you thought of any solution and approach to implement assisted generation on transformer-nueronx?

Nov 23 '23 06:11 Dev-hestabit

Thanks @gante for the feature!

I was trying out the following snippets and couldn't figure out which model-pairs are supported by the feature. And I've a couple of questions on how to use it.

What model-pairings are known to be supported by the model.generate(..., assistant_model='') feature?
Does it work for decoder-only model too? Anyone tried any pairs of decoder-only models available on the huggingface hub?

I suppose the assumption are that

the tokenizer must be the same for assistant and main model
the model is supported by AutoModelForCausalLM

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = 'EleutherAI/pythia-1.4b-deduped'
assistant = 'EleutherAI/pythia-160m-deduped'

tokenizer = AutoTokenizer.from_pretrained(checkpoint) #, bos_token_id=101, eos_token_id=102)
model = AutoModelForCausalLM.from_pretrained(checkpoint) #, bos_token_id=101, eos_token_id=102)

assistant_model = AutoModelForCausalLM.from_pretrained(assistant)

tokenized_inputs = tokenizer("Alice and Bob", return_tensors="pt")

outputs = model.generate(**tokenized_inputs, assistant_model=assistant_model)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

What I've tried

This works:

EleutherAI/pythia-1.4b-deduped + EleutherAI/pythia-160m-deduped

These didn't:

google-bert/bert-large-uncased + google-bert/bert-base-uncased (also had to add , bos_token_id=101, eos_token_id=102) to the model and/or tokenizer initialization to avoid None type when assistant model is scoping down the vocabulary)
FacebookAI/xlm-roberta-large + FacebookAI/xlm-roberta-base (ended up with TypeError: object of type 'NoneType' has no len() error when looking for candidate generation)

Mar 13 '24 19:03 alvations

@alvations 👋

It also works with encoder-decoder models, i.e. models supported by AutoModelForSeq2SeqLM. I am definitely unable to list all working cases, but feel free to open a new issue if you think should be working and isn't :)

Mar 14 '24 19:03 gante

transformers
transformers copied to clipboard

Generate: Add assisted generation

What does this PR do?

What I've tried

transformers transformers copied to clipboard

Generate: Add assisted generation

What does this PR do?

What I've tried

transformers
transformers copied to clipboard