rl-prompt Questions on the Gradients of LLM

Questions on the Gradients of LLM

Open Schwartz-Zha opened this issue 1 year ago • 0 comments

As I understand, one of the core contributions claimed in the paper is that the whole training does not require the derivatives of LLM, so it saves a lot of resources.

But how is this enforced in the code?

In LMAdaptorModel,

for param in self.generator.model.parameters():
            param.requires_grad = False

In PromptedClassificationReward, there is a no_grad decorator:

@torch.no_grad()
    def _get_logits(
        self,
        texts: List[str]
    ) -> torch.Tensor:

But my experiments show that, both methods cannot really forbid the computation of gradients.

Denote some network blocks as function $g$, and $g$ is restricted by no_grad or requires_grad = False. And there are some network blocks $f$ attached before $g$, so the whole networks looks like $$g(f(x))$$.

However, $f$ does require gradients as $f$ need to be updated. And my experiments show that the gradients of $g$ will be computed in this case, because there is no other way to compute the gradients of $f$. So no_grad/requires_grad = False will have no effect. The gradients will still be computed.

I wonder, in this case, how exactly does the author arrange to make the gradient computation of LLM never happens. Because the training runs too fast, this has no possibility to happen.

Nov 24 '23 08:11 Schwartz-Zha

rl-prompt rl-prompt copied to clipboard

Questions on the Gradients of LLM

rl-prompt
rl-prompt copied to clipboard