CogCoM icon indicating copy to clipboard operation
CogCoM copied to clipboard

CogCom will result in how much increase in training costs and inference time?

Open FaltingsA opened this issue 1 year ago • 1 comments

Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?

FaltingsA avatar Mar 16 '24 19:03 FaltingsA

Thanks for the great work! I am concern about the computation cost. CogCom will result in how much increase in training costs and inference time?

Hi, thanks for your interest! Compared to VLMs trained on single-image input, each CoM chain may consists of multiple turns of image-text pairs, which could linearly increase the training and inference time. We have restricted the maximum turns to <= 3 in the data processor. And in fact, many CoM chain can reach the answer by re-inputting the image after a single CropZoomIn manipulation on the original image.

qijimrc avatar Mar 20 '24 17:03 qijimrc