text-generation-inference
text-generation-inference copied to clipboard
feat: tokenize each request individually and increase warmup image size
This PR resolves some small issues with qwen2-vl.
- doubles the size of
WARMUP_IMAGE_BASE64from 20x20px to 40x40px (meets qwens minimal requirement without hacky fix) - removes hacky fix to double the warmup image
- prefer tokenizing each request instead of the whole batch at once. this change allows
r.truncateto be passed for each request - as previouslly it was not respected when one of the request was smaller than others in the batch. - sets
max_sto the max of max_s or the input size. This is required so the rotary and createself._cos_cachedof the correct size in relation to the position ids.
these changes resolve a startup issue reproducible with:
text-generation-launcher \
--model-id Qwen/Qwen2-VL-2B-Instruct \
--max-input-tokens 40 \
--max-batch-prefill-tokens 50 \
--max-total-tokens 51
*(note the underlying issue triggers when max-input-tokens is less than max-batch-prefill-tokens)
- from 20x20px to 40x40px (meets qwens minimal requirement without hacky fix)
I do not understand, why should we impose anything on the user for the images. If 20px x 20x is not supported we should:
- Rescale the image seemlessly and correctly infer on it.
- Reject the image with a proper error message. User's shouldn't have to know anything about the model's internals, 20x20px should be ok imho.
optimistically merging this PR as all tests pass, comments have been addressed, this image has been test/deployed in production and it fixes a bug when starting TGI with qwen2-vl.
Will watch for regressions and roll back if needed