LLaVA
LLaVA copied to clipboard
[Feature request] Why torch.no_grad on CLIPVisionTower Foward
feature
In llava/model/multimodal_encoder/clip_encoder.py
line 39, the forward pass of the vision encoder has torch.no_grad
decorator. I am trying to do some input optimization, and I think this is stopping gradients from being back propped to the input image. Is there a reason for this no_grad? Would it be ok to remove it? (I am happy to make a PR if so :) )
Thanks in advance for any help with this!
Hi @LukeBailey181
Thanks for the feedback and for your interest in our project. You are right that this torch.no_grad
is preventing you from doing input optimization. This maybe one of the overally cautious decorator I have used, to make sure that the vision encoder is not modified during pretraining/instruction tuning. Since we have vision_encoder.requires_grad_(False)
, this shall be fine.
It would be great if you can help create a PR about this, and we want to make sure that (1) vision encoder is not modified in any way that we do not want; (2) the gradients do not backward through the vision encoder unnecessarily (for most of the use cases, including the standard pretraining and instruction tuning), unless we need that like you are doing input optimization.
Thank you!
how to optimize the vision encoder? which code should i modify?