LightingModule optimizer issue - Thermostability fine-tuning
Hi all, I have been trying to run the example to fine tune the 650M model with the provided thermostability data. Unfortunately, I’m getting the following error
[rank 0]: TypeError: LightingModule.optimizer_step() takes from 4 to 5 positional arguments but 9 were given
I’m using exactly the same script provided, with only changing the number of GPU from 4 to 1, and CUDA_VISIBLE_DEVICES to 0.
Any help is greatly appreciated. Thank you
Hi,
I think it's due to the imcompatibility with the version of pytorch-lightning. Could you degrade your pytorch-lightning to 1.8.3?
Thanks, that solved the previous issue, and training started just fine.
Also, is there any way to automatically select to not visualize the results (option 3) from the interactive prompt? I’m submitting my job through slurm, and the error I’m getting I assume is because of that
wandb.errors.errors.Usage: api_key not configured (no-tty)
I also tried to set WANDB_MODE: dryrun in the config file, and wandb disabled but it did not work.
moreover, by setting logger: False I got that there is No supported gpu backend found!
Thanks
If you don't want to record your training then set logger to False should work. The error No supported gpu backend found!
seems to be caused by your hardware configuration. How did you run it normally as you said "that solved the previous issue, and training started just fine."?
To run it normally, I requested the necessary resources through srun in the terminal, and then run scripts/training.py, interactively chose option 3 an everything started just fine.
The issue was when I submitted the training through sbatch. Currently is not giving any error, but is stuck in
All distributed processes registered. Starting with 1 processes
I decided to create an account on wandb, but the problem through sbatch persists, and is still stuck.
The problem is more likely due to the sbatch command, not wandb. I'm not familiar with slurm. Perhaps you could check whether sbatch does some additional operations that conflicts with the python script?