LLaVA-NeXT
LLaVA-NeXT copied to clipboard
about data processing
Is the grounding data labeled before or after the data processor?
for example, an image needs padding to make width equal height, and the openai/clip-vit contains the do_center_crop. These operations will change the grounding location. But the labels in conversations remain the same. So I wonder if it will produce an impact on training.
hi, May I ask where this grounding dataset comes from? Also, does llava onevision have any special tokens for bboxes, or is there any preprocessing func for box? I didn't see it in the code.