InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

InternVL-det?

Open fyting opened this issue 1 year ago • 2 comments

Would it be possible to enhance the detection capability of InternVL by incorporating more data combined with grounding instructions during the fine-tuning stage?

fyting avatar Apr 30 '24 03:04 fyting

Interesting idea! We haven't tried it yet. You can test this idea by combining InternViT-1.5 and ViTDet (https://github.com/ViTAE-Transformer/ViTDet) or ViT-Adapter (https://github.com/czczup/ViT-Adapter).

whai362 avatar May 02 '24 09:05 whai362

Every time we adjust the visual encoder, we have to go through a long pre-training process (freezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) plus SFT (unfreezing the LLM, unfreezing the MLP, using around 1.2M data from the v1.2 version). Can we directly use the SFT data (unfreezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) to train the model, in order to speed up the process of adjusting the visual encoder? @whai362 @czczup

fyting avatar May 09 '24 06:05 fyting

Hi, It is still necessary to pretrain and align the MLP first. A more efficient training solution will leave for future work

G-z-w avatar Jul 24 '24 07:07 G-z-w