text-generation-inference feat: tokenize each request individually and increase warmup image size

This PR resolves some small issues with qwen2-vl.

doubles the size of WARMUP_IMAGE_BASE64 from 20x20px to 40x40px (meets qwens minimal requirement without hacky fix)
removes hacky fix to double the warmup image
prefer tokenizing each request instead of the whole batch at once. this change allows r.truncate to be passed for each request - as previouslly it was not respected when one of the request was smaller than others in the batch.
sets max_s to the max of max_s or the input size. This is required so the rotary and create self._cos_cached of the correct size in relation to the position ids.

these changes resolve a startup issue reproducible with:

text-generation-launcher \
--model-id Qwen/Qwen2-VL-2B-Instruct \
--max-input-tokens 40 \
--max-batch-prefill-tokens 50 \
--max-total-tokens 51

*(note the underlying issue triggers when max-input-tokens is less than max-batch-prefill-tokens)

Dec 05 '24 16:12 drbh

from 20x20px to 40x40px (meets qwens minimal requirement without hacky fix)

I do not understand, why should we impose anything on the user for the images. If 20px x 20x is not supported we should:

Rescale the image seemlessly and correctly infer on it.
Reject the image with a proper error message. User's shouldn't have to know anything about the model's internals, 20x20px should be ok imho.

Dec 09 '24 18:12 Narsil

optimistically merging this PR as all tests pass, comments have been addressed, this image has been test/deployed in production and it fixes a bug when starting TGI with qwen2-vl.

Will watch for regressions and roll back if needed

Jan 17 '25 16:01 drbh