Issue training the time-aware encoder

Open AldenBoby opened this issue 2 years ago • 1 comments

I run the command python main.py --train --base configs/stableSRNew/v2-finetune_text_T_512.yaml --gpus GPU_ID, --name NAME --scale_lr True with the config files configured for my data. Below is a snippet of what the terminal shows.

Monitoring val/loss_simple_ema as checkpoint metric. Merged modelckpt-cfg: {'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': './logs/2023-10-12T21-07-16_HRLP/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 20, 'every_n_train_steps': 1500}} GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs

Data

train, RealESRGANDataset, 432 validation, RealESRGANDataset, 100 accumulate_grad_batches = 4 Setting learning rate to 1.20e-03 = 4 (accumulate_grad_batches) * 1 (num_gpus) * 6 (batchsize) * 5.00e-05 (base_lr) Global seed set to 23 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl All DDP processes registered. Starting ddp with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

After this it connects to weights and biases and finishes, but the training does not take place and I do not see any expected output files.

Oct 12 '23 19:10 AldenBoby

The main problem should be the CPU RAM. You need at least 18G CPU RAM to run the code.

Oct 15 '23 03:10 IceClear