pytorch_optimizer
pytorch_optimizer copied to clipboard
sophiah in https://github.com/booydar/LM-RMT
#params = 151111638
#non emb params = 41066400
| epoch 1 step 50 | 50 batches | lr 0.06 | ms/batch 1378.43 | loss 7.85 | ppl 2570.784
| epoch 1 step 100 | 100 batches | lr 0.06 | ms/batch 968.61 | loss 7.49 | ppl 1787.593
| epoch 1 step 150 | 150 batches | lr 0.06 | ms/batch 971.58 | loss 7.48 | ppl 1769.387
| epoch 1 step 200 | 200 batches | lr 0.06 | ms/batch 969.84 | loss 7.47 | ppl 1760.055
| epoch 1 step 250 | 250 batches | lr 0.06 | ms/batch 973.37 | loss 7.46 | ppl 1738.300
| epoch 1 step 300 | 300 batches | lr 0.06 | ms/batch 970.12 | loss 7.48 | ppl 1772.002
| epoch 1 step 350 | 350 batches | lr 0.06 | ms/batch 970.52 | loss 7.47 | ppl 1751.793
| epoch 1 step 400 | 400 batches | lr 0.06 | ms/batch 973.12 | loss 7.47 | ppl 1755.161
| epoch 1 step 450 | 450 batches | lr 0.06 | ms/batch 970.79 | loss 7.46 | ppl 1736.315
| epoch 1 step 500 | 500 batches | lr 0.06 | ms/batch 974.13 | loss 7.48 | ppl 1765.010
| epoch 1 step 550 | 550 batches | lr 0.06 | ms/batch 973.86 | loss 7.48 | ppl 1778.569
Traceback (most recent call last):
File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 620, in
hmm, I guess it's not the optimizer problem, but maybe Pytorch autograd internal or the training code (e.g. model, loss, etc) issue.
I just found that a similar error occurs when the loss function is CPU-version loss.
maybe, some modules are not on the same device or there're unreachable graphs (leading to not-backprop-able).
Strange that it triggers only after so many steps seems like it would be a pytorch/sync issue.
Just wanted to say, if you are using Cross-Entropy loss (for LM) SophiaG variant is more efficient (since it's just squaring the gradient, see https://github.com/Liuhong99/Sophia/blob/19f45d30723bbffcce3d18e4e858d95b0f36dbb6/sophia.py#L56), you can use it like so (not tested):
hessian = list(map(lambda p: p.grad * p.grad, model.parameters()))
opt.step(hessian=hessian)
This also skips the 2nd order gradient calculation, so it could resolve your issue.
EDIT: you also need to filter out the non-trainable & sparse parameters so it would be more like:
hessian = [p.grad*p.grad for p in model.parameters() if p.requires_grad and p.grad is not None and not p.grad.is_sparse]
opt.step(hessian=hessian)
SophiaG worked, but the perfomace is not better than Adam, maybe because of the bias. So I want to try SophiaH, which hasn't the bias.
Some last things to check:
- In the
backwardcall you havecreate_graph=True - No batch accumulation (makes create_graph very expensive)
If this is all correct then it pretty much has to be a bug in pytorch (or the training code).
I have been running into a similar error message. I've been trying to use SophiaH with Lightning AI's automatic_optimization feature, but it always fails:
Traceback (most recent call last):
File "/src/trainer.py", line 403, in <module>
ai.train(
File "/usr/local/lib/python3.10/dist-packages/aitextgen/aitextgen.py", line 804, in train
trainer.fit(train_model)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
self.advance()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 219, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 188, in run
self._optimizer_step(kwargs.get("batch_idx", 0), closure)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 266, in _optimizer_step
call._call_lightning_module_hook(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 146, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1276, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 161, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 231, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 116, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
self.compute_hutchinson_hessian(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 303, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
If I iterate through every parameter group, and set requires_grad() to True, then I go OOM immediately at the "update_period" step:
for n, p in self.model.named_parameters():
p.requires_grad = True
If I set requires_grad() to False, then training will progress - but the model never learns anything.
If "requires_grad" in unset for ANY parameter group, I get the original error message.
I am unsure how to proceed at this point, but I would greatly appreciate any advice you have to offer.
I have been running into a similar error message. I've been trying to use SophiaH with Lightning AI's
automatic_optimizationfeature, but it always fails:Traceback (most recent call last): File "/src/trainer.py", line 403, in <module> ai.train( File "/usr/local/lib/python3.10/dist-packages/aitextgen/aitextgen.py", line 804, in train trainer.fit(train_model) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage self.fit_loop.run() File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run self.advance() File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run self.advance(data_fetcher) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 219, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 188, in run self._optimizer_step(kwargs.get("batch_idx", 0), closure) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 266, in _optimizer_step call._call_lightning_module_hook( File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 146, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1276, in optimizer_step optimizer.step(closure=optimizer_closure) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 161, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 231, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 116, in optimizer_step return optimizer.step(closure=closure, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step self.compute_hutchinson_hessian( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1) File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 303, in grad return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fnIf I iterate through every parameter group, and set
requires_grad()to True, then I go OOM immediately at the "update_period" step:for n, p in self.model.named_parameters(): p.requires_grad = TrueIf I set
requires_grad()to False, then training will progress - but the model never learns anything.If "requires_grad" in unset for ANY parameter group, I get the original error message.
I am unsure how to proceed at this point, but I would greatly appreciate any advice you have to offer.
Hello!
SophiaH optimizer needs to be set create_graph=True when calling backward(). means that automatic_optimization should be set False!
here's an example.
import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning.pytorch as pl
from pytorch_optimizer import SophiaH
from torch.optim import Optimizer
# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
class LitAutoEncoder(pl.LightningModule):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.automatic_optimization = False
def training_step(self, batch, batch_idx):
opt = self.optimizers()
opt.zero_grad()
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# important
self.manual_backward(loss, create_graph=True)
opt.step()
self.log("train_loss", loss)
def configure_optimizers(self):
return SophiaH(self.parameters())
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)
autoencoder = LitAutoEncoder(encoder, decoder)
trainer = pl.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
Thank you for the quick response! I have applied your example to my own code (to the best of my ability), and while we're making progress, training bombs with a new error after reaching the first update_period:
Traceback (most recent call last):
File "/src/trainer.py", line 403, in <module>
ai.train(
File "/usr/local/lib/python3.10/dist-packages/aitextgen/aitextgen.py", line 804, in train
trainer.fit(train_model)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
self.advance()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 221, in advance
batch_output = self.manual_optimization.run(kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/manual.py", line 91, in run
self.advance(kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/manual.py", line 111, in advance
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 294, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 380, in training_step
return self.model.training_step(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/aitextgen/train.py", line 59, in training_step
opt.step()
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 161, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 231, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 116, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
self.compute_hutchinson_hessian(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 303, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 4 of tensors does not require grad and does not have a grad_fn
I don't suspect this is the cause, but there is a warning at the beginning of training:
/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:200: UserWarning: Using backward() with create_graph=True will create a reference cycle between the parameter and its gradient which can cause a memory leak. We recommend using autograd.grad when creating the graph to avoid this. If you have to use this function, make sure to reset the .grad fields of your parameters to None after use to break the cycle and avoid the leak. (Triggered internally at ../torch/csrc/autograd/engine.cpp:1151.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
It may be relevant to know that I am using the Huggingface PEFT library for LoRA training. I don't suspect that is the issue either, since all that really does is add some extra layers to the model, and freeze all the other layers.
I will troubleshoot some more when I get the chance. It's been a long day already, and I need to take a break. Thank you for the help thus far, and for maintaining such a useful library!
Alright, well I was able to test your example MNIST code, and it does work. So I know this isn't an environment issue.
I removed PEFT as well, and tried standard fine-tuning. I also tried a couple of different models (GPT-2 and GPT-Neo), from Huggingface Transformers library. All ran into the same problem with "tensors does not require grad and does not have a grad_fn".
I'm sure the issue has to do with my training code. I'm carrying some legacy baggage, and I don't really have the proper skill set to know how to optimize manually (which is why I've relied on automatic_optimization until now). I haven't given up, but I probably am going to move on for now. I appreciate your help.