lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Improve warmup checking for max new tokens when using speculative decoding

Open tgaddair opened this issue 1 year ago • 1 comments

If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.

tgaddair avatar May 17 '24 22:05 tgaddair