lorax Improve warmup checking for max new tokens when using speculative decoding

Improve warmup checking for max new tokens when using speculative decoding

Open tgaddair opened this issue 1 year ago • 1 comments

If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.

May 17 '24 22:05 tgaddair

Is this issue still unresolved/valid? If it is still unresolved/valid, I would like to work on it.

Feb 18 '25 18:02 AmanSCoder

lorax lorax copied to clipboard

Improve warmup checking for max new tokens when using speculative decoding

lorax
lorax copied to clipboard