VisionLLM
VisionLLM copied to clipboard
About object detection
I think that you push below token in llm
['<cls>', '<x1>', '<y1>', '<x2>', '<y2>', '<cls>', '<x1>', '<y1>', '<x2>', '<y2>', '<cls>', '<x1>', '<y1>', '<x2>', '<y2>', ...]
about object detection loss, did you use hungarian matching like detr?
Or if you use just next token prediction by cross entropy loss, how to sort the ground-truth box?