sglang
sglang copied to clipboard
RuntimeError in llava image encoding
When running llava 1.6 mistral 7b, i get this error:
RuntimeError in llava image encoding: The expanded size of the tensor (0) must match the existing size (2438) at non-singleton dimension 0. Target sizes: [0, 4096]. Tensor sizes: [2438, 4096]
torch.Size([2758, 4096])
0 -1
Note the sizes 2438
and 2758
changes
This error happens randomly and is not specific to data.
Removing image input removes this error too.
Same bug raised by using gen(choices=[...])
with image input, as mentioned in #222 .
this bug is not related to using gen choices or not, but happens universally as long as i have image as input
Same bug raised by using
gen(choices=[...])
with image input, as mentioned in #222 .
that's because when generating the "logprob_start_len" variable used to get the start position of the choices, sglang does not add the necessary position shift when inserting image tokens into the input sequence.
https://github.com/sgl-project/sglang/blob/main/python/sglang/backend/runtime_endpoint.py#L217
For now, I add a temp fix to make it work as follows:
# Compute logprob
data = {
"text": [s.text_ + c for c in choices],
"sampling_params": {"max_new_tokens": 0},
"return_logprob": True,
"logprob_start_len": max(prompt_len - 1, 0),
# should be prompt_len-1 here i think, otherwise the normed_logp will be wrong
"return_text_in_logprobs": True,
}
self._add_images(s, data)
if s.images_: # only support one image
# TODO: This is a very naive way to shift the logprob_start_len
# maybe in future we should directly modify `prompt_tokens` variable
# to take the added image tokens into account
data["logprob_start_len"] += 576 - 1
I'm also running into the problem. It seems to be only a problem with the newer 1.6 LLaVA models.
Have you yet found out what exactly caused it and how to solve it. I do not use regex, thus, the gen choices should not be called.