pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Everything prints fine, but the loss doesn't descent

Open 2catycm opened this issue 1 year ago • 5 comments

Bug description

Even after I set the learning rate to 1 and even 100, the loss doesn't change at all, it is always 4.60. I tried to debug into what happens, but it seems everything works fine, the loss is backwarded successfully, the grads of each parameters looks well, the optimizer is indeed called

What version are you seeing the problem on?

v2.3

How to reproduce the bug

class ClassificationTask(L.LightningModule):
    def __init__(self, config: ClassificationTaskConfig)->None:
        super().__init__()
        self.save_hyperparameters(config.model_dump())
        L.seed_everything(config.experiment_index) # use index as the seed for reproducibility
        self.lit_data:ClassificationDataModule = config.dataset_config.get_lightning_data_module()
        config.cls_model_config.num_of_classes = self.lit_data.num_of_classes
        self.cls_model:HuggingfaceModel = config.cls_model_config.get_cls_model()
        self.lit_data.set_transform_from_hf_image_preprocessor(hf_image_preprocessor=self.cls_model.image_preprocessor)
        
        model_image_size:tuple[int, int] = (self.cls_model.image_preprocessor.size['height'], self.cls_model.image_preprocessor.size['width'])
        self.example_input_array = torch.Tensor(1, self.cls_model.backbone.config.num_channels, *model_image_size)
        
        self.softmax = nn.Softmax(dim=1)    
        self.loss = nn.CrossEntropyLoss(label_smoothing=config.label_smoothing)
        
        self.automatic_optimization = False # The problem occurs when True, so I tried to use False to see what happens
    
    def compute_model_logits(self, image_tensor:torch.Tensor)-> torch.Tensor:
        return self.cls_model(image_tensor)
    
    @override
    def forward(self, image_tensor:torch.Tensor, *args, **kwargs)-> torch.Tensor:
        return self.softmax(self.compute_model_logits(image_tensor))

    def forward_loss(self, image_tensor: torch.Tensor, label_tensor:torch.Tensor)->torch.Tensor:
        probs = self(image_tensor)
        # return F.nll_loss(logits, label_tensor)
        return self.loss(probs, label_tensor)
    
    @override
    def training_step(self, batch, batch_idx=None, *args, **kwargs)-> STEP_OUTPUT:
        self.train()
        opt = self.optimizers()
        opt.zero_grad()
        
        loss = self.forward_loss(*batch)
        self.log("train_loss", loss, prog_bar=True)
        # self.manual_backward(loss)
        loss.backward()
        opt.step()
        return loss

    @override    
    def configure_optimizers(self) -> OptimizerLRScheduler:
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.learning_rate)
from .core import ClassificationTask, ClassificationTaskConfig
config = ClassificationTaskConfig()
config.learning_rate = 3e-4 # doesn't work
config.learning_rate = 1000 # should expect a NaN if it is optimizing, try to debug
config.dataset_config.batch_size = 64
cls_task = ClassificationTask(config)

import lightning as L
from .utils import runs_path
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks import ModelSummary, StochasticWeightAveraging, DeviceStatsMonitor
from lightning.pytorch.loggers import TensorBoardLogger, CSVLogger
trainer = L.Trainer(default_root_dir=runs_path, enable_checkpointing=True, 
                    enable_model_summary=True, 
                    num_sanity_val_steps=2, 
                    callbacks=[
                        EarlyStopping(monitor="val_acc1", mode="max", check_finite=True, 
                                      patience=5, 
                                      check_on_train_epoch_end=False,  # check on validation end
                                      verbose=True),
                        ModelSummary(max_depth=3),
                        DeviceStatsMonitor(cpu_stats=True)
                               ]
                    
                    , logger=[TensorBoardLogger(save_dir=runs_path/"tensorboard"), CSVLogger(save_dir=runs_path)]
                    )
trainer.fit(cls_task, datamodule=cls_task.lit_data)

Error messages and logs

root
└── cls_model (HuggingfaceModel)
    ├── backbone (ViTModel)
    │   ├── embeddings (ViTEmbeddings) cls_token:[1, 1, 768] position_embeddings:[1, 197, 768]
    │   │   └── patch_embeddings (ViTPatchEmbeddings)
    │   │       └── projection (Conv2d) weight:[768, 3, 16, 16] bias:[768]
    │   ├── encoder (ViTEncoder)
    │   │   └── layer (ModuleList)
    │   │       └── 0-11(ViTLayer)
    │   │           ├── attention (ViTAttention)
    │   │           │   ├── attention (ViTSelfAttention)
    │   │           │   │   └── query,key,value(Linear) weight:[768, 768] bias:[768]
    │   │           │   └── output (ViTSelfOutput)
    │   │           │       └── dense (Linear) weight:[768, 768] bias:[768]
    │   │           ├── intermediate (ViTIntermediate)
    │   │           │   └── dense (Linear) weight:[3072, 768] bias:[3072]
    │   │           ├── output (ViTOutput)
    │   │           │   └── dense (Linear) weight:[768, 3072] bias:[768]
    │   │           └── layernorm_before,layernorm_after(LayerNorm) weight:[768] bias:[768]
    │   ├── layernorm (LayerNorm) weight:[768] bias:[768]
    │   └── pooler (ViTPooler)
    │       └── dense (Linear) weight:[768, 768] bias:[768]
    └── head (Linear) weight:[100, 768] bias:[100]
Files already downloaded and verified
Files already downloaded and verified
202

Sanity Checking: |          | 0/? [00:00<?, ?it/s]
Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:00<00:00,  1.78it/s]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  2.78it/s]
                                                                           

Training: |          | 0/? [00:00<?, ?it/s]
Training:   0%|          | 0/704 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/704 [00:00<?, ?it/s] 
Epoch 0:   0%|          | 1/704 [00:02<30:06,  0.39it/s]
Epoch 0:   0%|          | 1/704 [00:02<30:07,  0.39it/s, v_num=11, train_loss=4.610]
Epoch 0:   0%|          | 2/704 [00:03<17:54,  0.65it/s, v_num=11, train_loss=4.610]
Epoch 0:   0%|          | 2/704 [00:03<17:55,  0.65it/s, v_num=11, train_loss=4.610]
Epoch 0:   0%|          | 3/704 [00:03<13:59,  0.84it/s, v_num=11, train_loss=4.610]
Epoch 0:   0%|          | 3/704 [00:03<14:01,  0.83it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|          | 4/704 [00:03<11:26,  1.02it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|          | 4/704 [00:04<11:49,  0.99it/s, v_num=11, train_loss=4.610]
Epoch 0:   1%|          | 5/704 [00:04<09:31,  1.22it/s, v_num=11, train_loss=4.610]
Epoch 0:   1%|          | 5/704 [00:04<10:25,  1.12it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|          | 6/704 [00:04<08:46,  1.33it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|          | 6/704 [00:04<09:30,  1.22it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|          | 7/704 [00:04<08:11,  1.42it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|          | 7/704 [00:05<08:50,  1.31it/s, v_num=11, train_loss=4.610]
Epoch 0:   1%|          | 8/704 [00:05<07:52,  1.47it/s, v_num=11, train_loss=4.610]
Epoch 0:   1%|          | 8/704 [00:05<08:22,  1.39it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|▏         | 9/704 [00:05<07:35,  1.53it/s, v_num=11, train_loss=4.600]
Epoch 0:   1%|▏         | 9/704 [00:06<07:58,  1.45it/s, v_num=11, train_loss=4.610]
Epoch 0:   1%|▏         | 10/704 [00:06<07:18,  1.58it/s, v_num=11, train_loss=4.610]
Epoch 0:   1%|▏         | 10/704 [00:06<07:39,  1.51it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 11/704 [00:06<06:59,  1.65it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 11/704 [00:07<07:23,  1.56it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 12/704 [00:07<06:48,  1.70it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 12/704 [00:07<07:10,  1.61it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 13/704 [00:07<06:39,  1.73it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 13/704 [00:07<06:59,  1.65it/s, v_num=11, train_loss=4.600]
Epoch 0:   2%|▏         | 14/704 [00:07<06:30,  1.77it/s, v_num=11, train_loss=4.600]
Epoch 0:   2%|▏         | 14/704 [00:08<06:49,  1.68it/s, v_num=11, train_loss=4.600]
Epoch 0:   2%|▏         | 15/704 [00:08<06:23,  1.80it/s, v_num=11, train_loss=4.600]
Epoch 0:   2%|▏         | 15/704 [00:08<06:41,  1.72it/s, v_num=11, train_loss=4.600]
Epoch 0:   2%|▏         | 16/704 [00:08<06:16,  1.83it/s, v_num=11, train_loss=4.600]
Epoch 0:   2%|▏         | 16/704 [00:09<06:33,  1.75it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 17/704 [00:09<06:11,  1.85it/s, v_num=11, train_loss=4.610]
Epoch 0:   2%|▏         | 17/704 [00:09<06:27,  1.77it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 18/704 [00:09<06:06,  1.87it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 18/704 [00:10<06:21,  1.80it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 19/704 [00:10<06:02,  1.89it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 19/704 [00:10<06:15,  1.82it/s, v_num=11, train_loss=4.610]
Epoch 0:   3%|▎         | 20/704 [00:10<05:57,  1.91it/s, v_num=11, train_loss=4.610]
Epoch 0:   3%|▎         | 20/704 [00:10<06:10,  1.84it/s, v_num=11, train_loss=4.610]
Epoch 0:   3%|▎         | 21/704 [00:10<05:53,  1.93it/s, v_num=11, train_loss=4.610]
Epoch 0:   3%|▎         | 21/704 [00:11<06:06,  1.86it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 22/704 [00:11<05:50,  1.95it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 22/704 [00:11<06:02,  1.88it/s, v_num=11, train_loss=4.610]
Epoch 0:   3%|▎         | 23/704 [00:11<05:48,  1.95it/s, v_num=11, train_loss=4.610]
Epoch 0:   3%|▎         | 23/704 [00:12<05:58,  1.90it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 24/704 [00:12<05:44,  1.97it/s, v_num=11, train_loss=4.600]
Epoch 0:   3%|▎         | 24/704 [00:12<05:55,  1.91it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▎         | 25/704 [00:12<05:41,  1.99it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▎         | 25/704 [00:12<05:52,  1.93it/s, v_num=11, train_loss=4.610]
Epoch 0:   4%|▎         | 26/704 [00:13<05:39,  2.00it/s, v_num=11, train_loss=4.610]
Epoch 0:   4%|▎         | 26/704 [00:13<05:49,  1.94it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 27/704 [00:13<05:36,  2.01it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 27/704 [00:13<05:46,  1.95it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 28/704 [00:13<05:34,  2.02it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 28/704 [00:14<05:43,  1.97it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 29/704 [00:14<05:32,  2.03it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 29/704 [00:14<05:41,  1.98it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 30/704 [00:14<05:30,  2.04it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 30/704 [00:15<05:39,  1.99it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 31/704 [00:15<05:30,  2.03it/s, v_num=11, train_loss=4.600]
Epoch 0:   4%|▍         | 31/704 [00:15<05:36,  2.00it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▍         | 32/704 [00:15<05:27,  2.05it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▍         | 32/704 [00:15<05:34,  2.01it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▍         | 33/704 [00:15<05:24,  2.07it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▍         | 33/704 [00:16<05:32,  2.02it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▍         | 34/704 [00:16<05:23,  2.07it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▍         | 34/704 [00:16<05:30,  2.03it/s, v_num=11, train_loss=4.610]
Epoch 0:   5%|▍         | 35/704 [00:16<05:21,  2.08it/s, v_num=11, train_loss=4.610]
Epoch 0:   5%|▍         | 35/704 [00:17<05:29,  2.03it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▌         | 36/704 [00:17<05:20,  2.09it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▌         | 36/704 [00:17<05:27,  2.04it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▌         | 37/704 [00:17<05:18,  2.09it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▌         | 37/704 [00:18<05:25,  2.05it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▌         | 38/704 [00:18<05:17,  2.10it/s, v_num=11, train_loss=4.600]
Epoch 0:   5%|▌         | 38/704 [00:18<05:24,  2.06it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 39/704 [00:18<05:15,  2.11it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 39/704 [00:18<05:22,  2.06it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 40/704 [00:18<05:15,  2.11it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 40/704 [00:19<05:21,  2.07it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 41/704 [00:19<05:13,  2.12it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 41/704 [00:19<05:19,  2.07it/s, v_num=11, train_loss=4.610]
Epoch 0:   6%|▌         | 42/704 [00:19<05:12,  2.12it/s, v_num=11, train_loss=4.610]
Epoch 0:   6%|▌         | 42/704 [00:20<05:18,  2.08it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 43/704 [00:20<05:10,  2.13it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▌         | 43/704 [00:20<05:16,  2.09it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▋         | 44/704 [00:20<05:09,  2.13it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▋         | 44/704 [00:21<05:15,  2.09it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▋         | 45/704 [00:21<05:09,  2.13it/s, v_num=11, train_loss=4.600]
Epoch 0:   6%|▋         | 45/704 [00:21<05:14,  2.10it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 46/704 [00:21<05:07,  2.14it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 46/704 [00:21<05:13,  2.10it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 47/704 [00:21<05:06,  2.14it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 47/704 [00:22<05:12,  2.11it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 48/704 [00:22<05:05,  2.15it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 48/704 [00:22<05:10,  2.11it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 49/704 [00:22<05:05,  2.15it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 49/704 [00:23<05:09,  2.11it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 50/704 [00:23<05:04,  2.15it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 50/704 [00:23<05:08,  2.12it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 51/704 [00:23<05:03,  2.15it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 51/704 [00:24<05:08,  2.12it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 52/704 [00:24<05:02,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   7%|▋         | 52/704 [00:24<05:07,  2.12it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 53/704 [00:24<05:01,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 53/704 [00:24<05:06,  2.13it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 54/704 [00:24<05:00,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 54/704 [00:25<05:05,  2.13it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 55/704 [00:25<04:59,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 55/704 [00:25<05:04,  2.13it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 56/704 [00:25<04:58,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 56/704 [00:26<05:03,  2.14it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 57/704 [00:26<04:57,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 57/704 [00:26<05:02,  2.14it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 58/704 [00:26<04:57,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 58/704 [00:27<05:01,  2.14it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 59/704 [00:27<04:56,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:   8%|▊         | 59/704 [00:27<05:00,  2.15it/s, v_num=11, train_loss=4.610]
Epoch 0:   9%|▊         | 60/704 [00:27<04:55,  2.18it/s, v_num=11, train_loss=4.610]
Epoch 0:   9%|▊         | 60/704 [00:27<04:59,  2.15it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▊         | 61/704 [00:27<04:54,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▊         | 61/704 [00:28<04:58,  2.15it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 62/704 [00:28<04:53,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 62/704 [00:28<04:57,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 63/704 [00:28<04:53,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 63/704 [00:29<04:56,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 64/704 [00:29<04:52,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 64/704 [00:29<04:56,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 65/704 [00:29<04:51,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 65/704 [00:30<04:55,  2.16it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 66/704 [00:30<04:50,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:   9%|▉         | 66/704 [00:30<04:54,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 67/704 [00:30<04:50,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 67/704 [00:30<04:53,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 68/704 [00:30<04:49,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 68/704 [00:31<04:52,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 69/704 [00:31<04:48,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 69/704 [00:31<04:52,  2.17it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 70/704 [00:31<04:47,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|▉         | 70/704 [00:32<04:51,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|█         | 71/704 [00:32<04:47,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|█         | 71/704 [00:32<04:50,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|█         | 72/704 [00:32<04:46,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|█         | 72/704 [00:33<04:49,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|█         | 73/704 [00:33<04:45,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  10%|█         | 73/704 [00:33<04:49,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 74/704 [00:33<04:44,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 74/704 [00:33<04:48,  2.18it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 75/704 [00:33<04:44,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 75/704 [00:34<04:47,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 76/704 [00:34<04:44,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 76/704 [00:34<04:46,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 77/704 [00:34<04:43,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 77/704 [00:35<04:46,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 78/704 [00:35<04:42,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 78/704 [00:35<04:45,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 79/704 [00:35<04:42,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█         | 79/704 [00:36<04:44,  2.19it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█▏        | 80/704 [00:36<04:41,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  11%|█▏        | 80/704 [00:36<04:44,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 81/704 [00:36<04:40,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 81/704 [00:36<04:43,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 82/704 [00:36<04:39,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 82/704 [00:37<04:42,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 83/704 [00:37<04:39,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 83/704 [00:37<04:42,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 84/704 [00:37<04:38,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 84/704 [00:38<04:41,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 85/704 [00:38<04:38,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 85/704 [00:38<04:40,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 86/704 [00:38<04:37,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 86/704 [00:39<04:40,  2.20it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 87/704 [00:39<04:36,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▏        | 87/704 [00:39<04:39,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▎        | 88/704 [00:39<04:36,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  12%|█▎        | 88/704 [00:39<04:39,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 89/704 [00:39<04:36,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 89/704 [00:40<04:38,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 90/704 [00:40<04:35,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 90/704 [00:40<04:37,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 91/704 [00:40<04:34,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 91/704 [00:41<04:37,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 92/704 [00:41<04:34,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 92/704 [00:41<04:36,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 93/704 [00:41<04:33,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 93/704 [00:42<04:35,  2.21it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 94/704 [00:42<04:32,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 94/704 [00:42<04:35,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 95/704 [00:42<04:32,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  13%|█▎        | 95/704 [00:42<04:34,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▎        | 96/704 [00:42<04:31,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▎        | 96/704 [00:43<04:34,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 97/704 [00:43<04:31,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 97/704 [00:43<04:33,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 98/704 [00:43<04:30,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 98/704 [00:44<04:32,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 99/704 [00:44<04:30,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 99/704 [00:44<04:32,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 100/704 [00:44<04:29,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 100/704 [00:45<04:32,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 101/704 [00:45<04:29,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 101/704 [00:45<04:31,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 102/704 [00:45<04:28,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  14%|█▍        | 102/704 [00:45<04:30,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▍        | 103/704 [00:45<04:28,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▍        | 103/704 [00:46<04:30,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▍        | 104/704 [00:46<04:27,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▍        | 104/704 [00:46<04:29,  2.22it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▍        | 105/704 [00:46<04:26,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▍        | 105/704 [00:47<04:29,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 106/704 [00:47<04:26,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 106/704 [00:47<04:28,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 107/704 [00:47<04:25,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 107/704 [00:48<04:28,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 108/704 [00:48<04:25,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 108/704 [00:48<04:27,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 109/704 [00:48<04:24,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  15%|█▌        | 109/704 [00:48<04:26,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 110/704 [00:48<04:24,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 110/704 [00:49<04:26,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 111/704 [00:49<04:23,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 111/704 [00:49<04:25,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 112/704 [00:49<04:23,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 112/704 [00:50<04:25,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 113/704 [00:50<04:22,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 113/704 [00:50<04:24,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 114/704 [00:50<04:22,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▌        | 114/704 [00:51<04:24,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▋        | 115/704 [00:51<04:21,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▋        | 115/704 [00:51<04:23,  2.23it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▋        | 116/704 [00:51<04:21,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  16%|█▋        | 116/704 [00:51<04:23,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 117/704 [00:51<04:20,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 117/704 [00:52<04:22,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 118/704 [00:52<04:20,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 118/704 [00:52<04:21,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 119/704 [00:52<04:19,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 119/704 [00:53<04:21,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 120/704 [00:53<04:19,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 120/704 [00:53<04:20,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 121/704 [00:53<04:18,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 121/704 [00:54<04:20,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 122/704 [00:54<04:18,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 122/704 [00:54<04:19,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 123/704 [00:54<04:17,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  17%|█▋        | 123/704 [00:54<04:19,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 124/704 [00:54<04:17,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 124/704 [00:55<04:18,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 125/704 [00:55<04:16,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 125/704 [00:55<04:18,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 126/704 [00:55<04:15,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 126/704 [00:56<04:17,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 127/704 [00:56<04:15,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 127/704 [00:56<04:17,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 128/704 [00:56<04:15,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 128/704 [00:57<04:16,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 129/704 [00:57<04:14,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 129/704 [00:57<04:16,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 130/704 [00:57<04:13,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  18%|█▊        | 130/704 [00:57<04:15,  2.24it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▊        | 131/704 [00:57<04:13,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▊        | 131/704 [00:58<04:15,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 132/704 [00:58<04:12,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 132/704 [00:58<04:14,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 133/704 [00:58<04:12,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 133/704 [00:59<04:14,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 134/704 [00:59<04:12,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 134/704 [00:59<04:13,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 135/704 [00:59<04:11,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 135/704 [01:00<04:13,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 136/704 [01:00<04:11,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 136/704 [01:00<04:12,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 137/704 [01:00<04:10,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  19%|█▉        | 137/704 [01:00<04:12,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|█▉        | 138/704 [01:01<04:10,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|█▉        | 138/704 [01:01<04:11,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|█▉        | 139/704 [01:01<04:09,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|█▉        | 139/704 [01:01<04:11,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|█▉        | 140/704 [01:01<04:09,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|█▉        | 140/704 [01:02<04:10,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 141/704 [01:02<04:08,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 141/704 [01:02<04:10,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 142/704 [01:02<04:08,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 142/704 [01:03<04:09,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 143/704 [01:03<04:07,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 143/704 [01:03<04:09,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 144/704 [01:03<04:07,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  20%|██        | 144/704 [01:03<04:08,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 145/704 [01:04<04:06,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 145/704 [01:04<04:08,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 146/704 [01:04<04:06,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 146/704 [01:04<04:07,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 147/704 [01:04<04:05,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 147/704 [01:05<04:07,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 148/704 [01:05<04:05,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 148/704 [01:05<04:06,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 149/704 [01:05<04:04,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██        | 149/704 [01:06<04:06,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██▏       | 150/704 [01:06<04:04,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██▏       | 150/704 [01:06<04:05,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██▏       | 151/704 [01:06<04:04,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  21%|██▏       | 151/704 [01:07<04:05,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 152/704 [01:07<04:03,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 152/704 [01:07<04:04,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 153/704 [01:07<04:03,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 153/704 [01:07<04:04,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 154/704 [01:07<04:02,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 154/704 [01:08<04:03,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 155/704 [01:08<04:02,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 155/704 [01:08<04:03,  2.25it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 156/704 [01:08<04:01,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 156/704 [01:09<04:02,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 157/704 [01:09<04:01,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 157/704 [01:09<04:02,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 158/704 [01:09<04:00,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  22%|██▏       | 158/704 [01:10<04:01,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 159/704 [01:10<04:00,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 159/704 [01:10<04:01,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 160/704 [01:10<03:59,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 160/704 [01:10<04:01,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 161/704 [01:10<03:59,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 161/704 [01:11<04:00,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 162/704 [01:11<03:58,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 162/704 [01:11<04:00,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 163/704 [01:11<03:58,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 163/704 [01:12<03:59,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 164/704 [01:12<03:57,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 164/704 [01:12<03:59,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 165/704 [01:12<03:57,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  23%|██▎       | 165/704 [01:13<03:58,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▎       | 166/704 [01:13<03:56,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▎       | 166/704 [01:13<03:58,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▎       | 167/704 [01:13<03:56,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▎       | 167/704 [01:13<03:57,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 168/704 [01:13<03:55,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 168/704 [01:14<03:57,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 169/704 [01:14<03:55,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 169/704 [01:14<03:56,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 170/704 [01:14<03:55,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 170/704 [01:15<03:56,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 171/704 [01:15<03:54,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 171/704 [01:15<03:55,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 172/704 [01:15<03:54,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  24%|██▍       | 172/704 [01:16<03:55,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▍       | 173/704 [01:16<03:53,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▍       | 173/704 [01:16<03:54,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▍       | 174/704 [01:16<03:53,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▍       | 174/704 [01:16<03:54,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▍       | 175/704 [01:16<03:52,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▍       | 175/704 [01:17<03:53,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 176/704 [01:17<03:52,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 176/704 [01:17<03:53,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 177/704 [01:17<03:51,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 177/704 [01:18<03:52,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 178/704 [01:18<03:51,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 178/704 [01:18<03:52,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 179/704 [01:18<03:50,  2.27it/s, v_num=11, train_loss=4.600]
Epoch 0:  25%|██▌       | 179/704 [01:19<03:51,  2.26it/s, v_num=11, train_loss=4.600]
Epoch 0:  26%|██▌       | 180/704 [01:19<03:50,  2.27it/s, v_num=11, train_loss=4.600]

everything is not crashing, and the model summary looks good, but the training loss just doesn't change (different batch sample has a slight change, but not due to training of the model)

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0): 2.3.3
#- PyTorch Version (e.g., 2.4): 2.3.1
#- Python version (e.g., 3.12): 3.10.14
#- OS (e.g., Linux):  Linux
#- CUDA/cuDNN version: 12.4
#- GPU models and configuration: 3090
#- How you installed Lightning(`conda`, `pip`, source): pip

The collect env script is not working, btw

Traceback (most recent call last):
 
  File "/conda/envs/ai/lib/python3.10/site-packages/pkg_resources/_vendor/pyparsing.py", line 2711, in parseImpl
    raise ParseException(instring, loc, self.errmsg, self)
pkg_resources._vendor.pyparsing.ParseException: Expected W:(abcd...) (at char 0), (line:1, col:1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
    raise InvalidRequirement(
pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse error at "'-cipy==1'": Expected W:(abcd...)

More info

No response

2catycm avatar Oct 15 '24 20:10 2catycm

My full code is a little bit complicated, but I believe the problem is just within the above logics, did I use Lightning wrong in the above code?

2catycm avatar Oct 15 '24 20:10 2catycm

grateful if somrone can give me any idea about what may cause such issue.

i thought about if the cls_model is errorly frozen, but it is not, parameters of it are requires_grad.

2catycm avatar Oct 15 '24 20:10 2catycm

may be it is related to this issue https://github.com/Lightning-AI/pytorch-lightning/issues/20128

i am also using huggingface's automodel from pretrain, and mode is eval.

i tried to manually called training,but it doesnot work

2catycm avatar Oct 15 '24 20:10 2catycm

No, it is not because of that issue. I double checked that I called nn.Module.train() ever since I use AutoModel.from_pretrained.

2catycm avatar Oct 16 '24 04:10 2catycm

To debug, I print the parameters and gradients 's L2 norm every time training_step is called. Something interesting happens.

Grad Norm: 0.08665306866168976
Params Norm before step: 771.8257446289062
Params Norm after step: 771.8740234375
Grad Norm: 9.2427133654982e-12
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 1.773968298646178e-11
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 1.1152222808424872e-12
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 6.962481264270737e-13
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 1.828729181974076e-11
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375

The optimizer indeed made a change to the model, which is self, the L.LightningModule instance. However, the gradient goes to very small somehow.

Can any experts kindly tell me where did I am use wrong of Lightning?

2catycm avatar Oct 16 '24 04:10 2catycm

Hye,

Am not an expert, but I checked your code and you seem to do loss.backward() instead of self.manual_backward(loss) as stated in the documentation (https://lightning.ai/docs/pytorch/stable/model/manual_optimization.html#manual-optimization).

Can you see if this helps?

arijit-hub avatar Oct 25 '24 00:10 arijit-hub

I think it is fair to conclude that the issue is not with lightning here. The gradients are being correctly updated and backpropergated however they are very small. I am going to assume that you are experiencing some kind of vanishing gradient problem due to the model being used. Please make sure that:

  • calling .train on models before calling trainer.fit
  • if you are doing manual optimization that you are calling self.manual_backward
  • I noticed that you are using nn.Softmax in combination with nn.CrossEntropyLoss which is not correct. Cross entropy loss expects the input tensor to be logits not probabilities.

Closing issue, but feel free to ping and reopen if necessary. We are probably going to need a fully reproducible example to be able to help more.

SkafteNicki avatar Sep 13 '25 11:09 SkafteNicki