pytorch-optimizer icon indicating copy to clipboard operation
pytorch-optimizer copied to clipboard

Adafactor fails to run on a custom (rfs) resnet12 (with MAML)

Open brando90 opened this issue 3 years ago • 3 comments

I was trying adafactor but I get the following issues:

args.scheduler=None
--------------------- META-TRAIN ------------------------
Starting training!
Traceback (most recent call last):
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 441, in <module>
    main_resume_from_checkpoint(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 403, in main_resume_from_checkpoint
    run_training(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/experiments/meta_learning/main_metalearning.py", line 413, in run_training
    meta_train_fixed_iterations(args)
  File "/home/miranda9/automl-meta-learning/automl-proj-src/meta_learning/training/meta_training.py", line 233, in meta_train_fixed_iterations
    args.outer_opt.step()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch_optimizer/adafactor.py", line 191, in step
    self._approx_sq_grad(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch_optimizer/adafactor.py", line 116, in _approx_sq_grad
    (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1))
RuntimeError: The size of tensor a (3) must match the size of tensor b (64) at non-singleton dimension 1

with the pytorch default adam training runs so why does this one fail?

related:

  • https://github.com/jettify/pytorch-optimizer/issues/404
  • https://stackoverflow.com/questions/70218565/how-to-have-adafactor-run-a-custom-rfs-resnet12-with-maml-for-torch-optimize?noredirect=1&lq=1

brando90 avatar Dec 03 '21 17:12 brando90

are there any updates on this? The issue is still present

ionutmodo avatar Jul 10 '23 09:07 ionutmodo

I had a look at this error which I also faced when training a ResNet-50 model. I got a similar error as @brando90, except that the dimensions of my tensors were different. Please read further in order to understand how I managed to fix this issue.

First of all, the exception is raised from here, where the tensor exp_avg_sq_row is divided by the mean over the last dimension. In my case, exp_avg_sq_row has size [64, 3, 7]. When computing the mean over the last dimension, the result exp_avg_sq_row.mean(dim=-1) will have size [64, 3] and the dimension mismatch for this division operation raises the RuntimeError.

The solution is to unsqueeze the mean tensor such that instead of doing (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1)), we should do (exp_avg_sq_row / exp_avg_sq_row.mean(dim=-1).unsqueeze(-1)).

ionutmodo avatar Jul 10 '23 13:07 ionutmodo

still happens, someone make a pull request?

Xynonners avatar Jul 28 '23 05:07 Xynonners