InternVL InternVL-det?

Would it be possible to enhance the detection capability of InternVL by incorporating more data combined with grounding instructions during the fine-tuning stage?

Apr 30 '24 03:04 fyting

Interesting idea! We haven't tried it yet. You can test this idea by combining InternViT-1.5 and ViTDet (https://github.com/ViTAE-Transformer/ViTDet) or ViT-Adapter (https://github.com/czczup/ViT-Adapter).

May 02 '24 09:05 whai362

Every time we adjust the visual encoder, we have to go through a long pre-training process (freezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) plus SFT (unfreezing the LLM, unfreezing the MLP, using around 1.2M data from the v1.2 version). Can we directly use the SFT data (unfreezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) to train the model, in order to speed up the process of adjusting the visual encoder? @whai362 @czczup

May 09 '24 06:05 fyting

Hi, It is still necessary to pretrain and align the MLP first. A more efficient training solution will leave for future work

Jul 24 '24 07:07 G-z-w