Issue training the time-aware encoder
I run the command python main.py --train --base configs/stableSRNew/v2-finetune_text_T_512.yaml --gpus GPU_ID, --name NAME --scale_lr True with the config files configured for my data. Below is a snippet of what the terminal shows.
Monitoring val/loss_simple_ema as checkpoint metric. Merged modelckpt-cfg: {'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': './logs/2023-10-12T21-07-16_HRLP/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 20, 'every_n_train_steps': 1500}} GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs
Data
train, RealESRGANDataset, 432 validation, RealESRGANDataset, 100 accumulate_grad_batches = 4 Setting learning rate to 1.20e-03 = 4 (accumulate_grad_batches) * 1 (num_gpus) * 6 (batchsize) * 5.00e-05 (base_lr) Global seed set to 23 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=nccl All DDP processes registered. Starting ddp with 1 processes
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
After this it connects to weights and biases and finishes, but the training does not take place and I do not see any expected output files.
The main problem should be the CPU RAM. You need at least 18G CPU RAM to run the code.