LLaVA [Feature request] Why torch.no_grad on CLIPVisionTower Foward

[Feature request] Why torch.no_grad on CLIPVisionTower Foward

Open LukeBailey181 opened this issue 1 year ago • 2 comments

feature

In llava/model/multimodal_encoder/clip_encoder.py line 39, the forward pass of the vision encoder has torch.no_grad decorator. I am trying to do some input optimization, and I think this is stopping gradients from being back propped to the input image. Is there a reason for this no_grad? Would it be ok to remove it? (I am happy to make a PR if so :) )

Thanks in advance for any help with this!

Jul 25 '23 01:07 LukeBailey181

Hi @LukeBailey181

Thanks for the feedback and for your interest in our project. You are right that this torch.no_grad is preventing you from doing input optimization. This maybe one of the overally cautious decorator I have used, to make sure that the vision encoder is not modified during pretraining/instruction tuning. Since we have vision_encoder.requires_grad_(False), this shall be fine.

It would be great if you can help create a PR about this, and we want to make sure that (1) vision encoder is not modified in any way that we do not want; (2) the gradients do not backward through the vision encoder unnecessarily (for most of the use cases, including the standard pretraining and instruction tuning), unless we need that like you are doing input optimization.

Thank you!

Jul 26 '23 03:07 haotian-liu

how to optimize the vision encoder? which code should i modify?

Feb 28 '24 04:02 harrytea

LLaVA LLaVA copied to clipboard

[Feature request] Why torch.no_grad on CLIPVisionTower Foward

feature

LLaVA
LLaVA copied to clipboard