Everything prints fine, but the loss doesn't descent
Bug description
Even after I set the learning rate to 1 and even 100, the loss doesn't change at all, it is always 4.60. I tried to debug into what happens, but it seems everything works fine, the loss is backwarded successfully, the grads of each parameters looks well, the optimizer is indeed called
What version are you seeing the problem on?
v2.3
How to reproduce the bug
class ClassificationTask(L.LightningModule):
def __init__(self, config: ClassificationTaskConfig)->None:
super().__init__()
self.save_hyperparameters(config.model_dump())
L.seed_everything(config.experiment_index) # use index as the seed for reproducibility
self.lit_data:ClassificationDataModule = config.dataset_config.get_lightning_data_module()
config.cls_model_config.num_of_classes = self.lit_data.num_of_classes
self.cls_model:HuggingfaceModel = config.cls_model_config.get_cls_model()
self.lit_data.set_transform_from_hf_image_preprocessor(hf_image_preprocessor=self.cls_model.image_preprocessor)
model_image_size:tuple[int, int] = (self.cls_model.image_preprocessor.size['height'], self.cls_model.image_preprocessor.size['width'])
self.example_input_array = torch.Tensor(1, self.cls_model.backbone.config.num_channels, *model_image_size)
self.softmax = nn.Softmax(dim=1)
self.loss = nn.CrossEntropyLoss(label_smoothing=config.label_smoothing)
self.automatic_optimization = False # The problem occurs when True, so I tried to use False to see what happens
def compute_model_logits(self, image_tensor:torch.Tensor)-> torch.Tensor:
return self.cls_model(image_tensor)
@override
def forward(self, image_tensor:torch.Tensor, *args, **kwargs)-> torch.Tensor:
return self.softmax(self.compute_model_logits(image_tensor))
def forward_loss(self, image_tensor: torch.Tensor, label_tensor:torch.Tensor)->torch.Tensor:
probs = self(image_tensor)
# return F.nll_loss(logits, label_tensor)
return self.loss(probs, label_tensor)
@override
def training_step(self, batch, batch_idx=None, *args, **kwargs)-> STEP_OUTPUT:
self.train()
opt = self.optimizers()
opt.zero_grad()
loss = self.forward_loss(*batch)
self.log("train_loss", loss, prog_bar=True)
# self.manual_backward(loss)
loss.backward()
opt.step()
return loss
@override
def configure_optimizers(self) -> OptimizerLRScheduler:
return torch.optim.AdamW(self.parameters(), lr=self.hparams.learning_rate)
from .core import ClassificationTask, ClassificationTaskConfig
config = ClassificationTaskConfig()
config.learning_rate = 3e-4 # doesn't work
config.learning_rate = 1000 # should expect a NaN if it is optimizing, try to debug
config.dataset_config.batch_size = 64
cls_task = ClassificationTask(config)
import lightning as L
from .utils import runs_path
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from lightning.pytorch.callbacks import ModelSummary, StochasticWeightAveraging, DeviceStatsMonitor
from lightning.pytorch.loggers import TensorBoardLogger, CSVLogger
trainer = L.Trainer(default_root_dir=runs_path, enable_checkpointing=True,
enable_model_summary=True,
num_sanity_val_steps=2,
callbacks=[
EarlyStopping(monitor="val_acc1", mode="max", check_finite=True,
patience=5,
check_on_train_epoch_end=False, # check on validation end
verbose=True),
ModelSummary(max_depth=3),
DeviceStatsMonitor(cpu_stats=True)
]
, logger=[TensorBoardLogger(save_dir=runs_path/"tensorboard"), CSVLogger(save_dir=runs_path)]
)
trainer.fit(cls_task, datamodule=cls_task.lit_data)
Error messages and logs
root
└── cls_model (HuggingfaceModel)
├── backbone (ViTModel)
│ ├── embeddings (ViTEmbeddings) cls_token:[1, 1, 768] position_embeddings:[1, 197, 768]
│ │ └── patch_embeddings (ViTPatchEmbeddings)
│ │ └── projection (Conv2d) weight:[768, 3, 16, 16] bias:[768]
│ ├── encoder (ViTEncoder)
│ │ └── layer (ModuleList)
│ │ └── 0-11(ViTLayer)
│ │ ├── attention (ViTAttention)
│ │ │ ├── attention (ViTSelfAttention)
│ │ │ │ └── query,key,value(Linear) weight:[768, 768] bias:[768]
│ │ │ └── output (ViTSelfOutput)
│ │ │ └── dense (Linear) weight:[768, 768] bias:[768]
│ │ ├── intermediate (ViTIntermediate)
│ │ │ └── dense (Linear) weight:[3072, 768] bias:[3072]
│ │ ├── output (ViTOutput)
│ │ │ └── dense (Linear) weight:[768, 3072] bias:[768]
│ │ └── layernorm_before,layernorm_after(LayerNorm) weight:[768] bias:[768]
│ ├── layernorm (LayerNorm) weight:[768] bias:[768]
│ └── pooler (ViTPooler)
│ └── dense (Linear) weight:[768, 768] bias:[768]
└── head (Linear) weight:[100, 768] bias:[100]
Files already downloaded and verified
Files already downloaded and verified
202
Sanity Checking: | | 0/? [00:00<?, ?it/s]
Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:00<00:00, 1.78it/s]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 2.78it/s]
Training: | | 0/? [00:00<?, ?it/s]
Training: 0%| | 0/704 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/704 [00:00<?, ?it/s]
Epoch 0: 0%| | 1/704 [00:02<30:06, 0.39it/s]
Epoch 0: 0%| | 1/704 [00:02<30:07, 0.39it/s, v_num=11, train_loss=4.610]
Epoch 0: 0%| | 2/704 [00:03<17:54, 0.65it/s, v_num=11, train_loss=4.610]
Epoch 0: 0%| | 2/704 [00:03<17:55, 0.65it/s, v_num=11, train_loss=4.610]
Epoch 0: 0%| | 3/704 [00:03<13:59, 0.84it/s, v_num=11, train_loss=4.610]
Epoch 0: 0%| | 3/704 [00:03<14:01, 0.83it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%| | 4/704 [00:03<11:26, 1.02it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%| | 4/704 [00:04<11:49, 0.99it/s, v_num=11, train_loss=4.610]
Epoch 0: 1%| | 5/704 [00:04<09:31, 1.22it/s, v_num=11, train_loss=4.610]
Epoch 0: 1%| | 5/704 [00:04<10:25, 1.12it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%| | 6/704 [00:04<08:46, 1.33it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%| | 6/704 [00:04<09:30, 1.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%| | 7/704 [00:04<08:11, 1.42it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%| | 7/704 [00:05<08:50, 1.31it/s, v_num=11, train_loss=4.610]
Epoch 0: 1%| | 8/704 [00:05<07:52, 1.47it/s, v_num=11, train_loss=4.610]
Epoch 0: 1%| | 8/704 [00:05<08:22, 1.39it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%|▏ | 9/704 [00:05<07:35, 1.53it/s, v_num=11, train_loss=4.600]
Epoch 0: 1%|▏ | 9/704 [00:06<07:58, 1.45it/s, v_num=11, train_loss=4.610]
Epoch 0: 1%|▏ | 10/704 [00:06<07:18, 1.58it/s, v_num=11, train_loss=4.610]
Epoch 0: 1%|▏ | 10/704 [00:06<07:39, 1.51it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 11/704 [00:06<06:59, 1.65it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 11/704 [00:07<07:23, 1.56it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 12/704 [00:07<06:48, 1.70it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 12/704 [00:07<07:10, 1.61it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 13/704 [00:07<06:39, 1.73it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 13/704 [00:07<06:59, 1.65it/s, v_num=11, train_loss=4.600]
Epoch 0: 2%|▏ | 14/704 [00:07<06:30, 1.77it/s, v_num=11, train_loss=4.600]
Epoch 0: 2%|▏ | 14/704 [00:08<06:49, 1.68it/s, v_num=11, train_loss=4.600]
Epoch 0: 2%|▏ | 15/704 [00:08<06:23, 1.80it/s, v_num=11, train_loss=4.600]
Epoch 0: 2%|▏ | 15/704 [00:08<06:41, 1.72it/s, v_num=11, train_loss=4.600]
Epoch 0: 2%|▏ | 16/704 [00:08<06:16, 1.83it/s, v_num=11, train_loss=4.600]
Epoch 0: 2%|▏ | 16/704 [00:09<06:33, 1.75it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 17/704 [00:09<06:11, 1.85it/s, v_num=11, train_loss=4.610]
Epoch 0: 2%|▏ | 17/704 [00:09<06:27, 1.77it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 18/704 [00:09<06:06, 1.87it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 18/704 [00:10<06:21, 1.80it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 19/704 [00:10<06:02, 1.89it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 19/704 [00:10<06:15, 1.82it/s, v_num=11, train_loss=4.610]
Epoch 0: 3%|▎ | 20/704 [00:10<05:57, 1.91it/s, v_num=11, train_loss=4.610]
Epoch 0: 3%|▎ | 20/704 [00:10<06:10, 1.84it/s, v_num=11, train_loss=4.610]
Epoch 0: 3%|▎ | 21/704 [00:10<05:53, 1.93it/s, v_num=11, train_loss=4.610]
Epoch 0: 3%|▎ | 21/704 [00:11<06:06, 1.86it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 22/704 [00:11<05:50, 1.95it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 22/704 [00:11<06:02, 1.88it/s, v_num=11, train_loss=4.610]
Epoch 0: 3%|▎ | 23/704 [00:11<05:48, 1.95it/s, v_num=11, train_loss=4.610]
Epoch 0: 3%|▎ | 23/704 [00:12<05:58, 1.90it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 24/704 [00:12<05:44, 1.97it/s, v_num=11, train_loss=4.600]
Epoch 0: 3%|▎ | 24/704 [00:12<05:55, 1.91it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▎ | 25/704 [00:12<05:41, 1.99it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▎ | 25/704 [00:12<05:52, 1.93it/s, v_num=11, train_loss=4.610]
Epoch 0: 4%|▎ | 26/704 [00:13<05:39, 2.00it/s, v_num=11, train_loss=4.610]
Epoch 0: 4%|▎ | 26/704 [00:13<05:49, 1.94it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 27/704 [00:13<05:36, 2.01it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 27/704 [00:13<05:46, 1.95it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 28/704 [00:13<05:34, 2.02it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 28/704 [00:14<05:43, 1.97it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 29/704 [00:14<05:32, 2.03it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 29/704 [00:14<05:41, 1.98it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 30/704 [00:14<05:30, 2.04it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 30/704 [00:15<05:39, 1.99it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 31/704 [00:15<05:30, 2.03it/s, v_num=11, train_loss=4.600]
Epoch 0: 4%|▍ | 31/704 [00:15<05:36, 2.00it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▍ | 32/704 [00:15<05:27, 2.05it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▍ | 32/704 [00:15<05:34, 2.01it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▍ | 33/704 [00:15<05:24, 2.07it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▍ | 33/704 [00:16<05:32, 2.02it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▍ | 34/704 [00:16<05:23, 2.07it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▍ | 34/704 [00:16<05:30, 2.03it/s, v_num=11, train_loss=4.610]
Epoch 0: 5%|▍ | 35/704 [00:16<05:21, 2.08it/s, v_num=11, train_loss=4.610]
Epoch 0: 5%|▍ | 35/704 [00:17<05:29, 2.03it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▌ | 36/704 [00:17<05:20, 2.09it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▌ | 36/704 [00:17<05:27, 2.04it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▌ | 37/704 [00:17<05:18, 2.09it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▌ | 37/704 [00:18<05:25, 2.05it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▌ | 38/704 [00:18<05:17, 2.10it/s, v_num=11, train_loss=4.600]
Epoch 0: 5%|▌ | 38/704 [00:18<05:24, 2.06it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 39/704 [00:18<05:15, 2.11it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 39/704 [00:18<05:22, 2.06it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 40/704 [00:18<05:15, 2.11it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 40/704 [00:19<05:21, 2.07it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 41/704 [00:19<05:13, 2.12it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 41/704 [00:19<05:19, 2.07it/s, v_num=11, train_loss=4.610]
Epoch 0: 6%|▌ | 42/704 [00:19<05:12, 2.12it/s, v_num=11, train_loss=4.610]
Epoch 0: 6%|▌ | 42/704 [00:20<05:18, 2.08it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 43/704 [00:20<05:10, 2.13it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▌ | 43/704 [00:20<05:16, 2.09it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▋ | 44/704 [00:20<05:09, 2.13it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▋ | 44/704 [00:21<05:15, 2.09it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▋ | 45/704 [00:21<05:09, 2.13it/s, v_num=11, train_loss=4.600]
Epoch 0: 6%|▋ | 45/704 [00:21<05:14, 2.10it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 46/704 [00:21<05:07, 2.14it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 46/704 [00:21<05:13, 2.10it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 47/704 [00:21<05:06, 2.14it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 47/704 [00:22<05:12, 2.11it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 48/704 [00:22<05:05, 2.15it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 48/704 [00:22<05:10, 2.11it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 49/704 [00:22<05:05, 2.15it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 49/704 [00:23<05:09, 2.11it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 50/704 [00:23<05:04, 2.15it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 50/704 [00:23<05:08, 2.12it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 51/704 [00:23<05:03, 2.15it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 51/704 [00:24<05:08, 2.12it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 52/704 [00:24<05:02, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 7%|▋ | 52/704 [00:24<05:07, 2.12it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 53/704 [00:24<05:01, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 53/704 [00:24<05:06, 2.13it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 54/704 [00:24<05:00, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 54/704 [00:25<05:05, 2.13it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 55/704 [00:25<04:59, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 55/704 [00:25<05:04, 2.13it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 56/704 [00:25<04:58, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 56/704 [00:26<05:03, 2.14it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 57/704 [00:26<04:57, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 57/704 [00:26<05:02, 2.14it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 58/704 [00:26<04:57, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 58/704 [00:27<05:01, 2.14it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 59/704 [00:27<04:56, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 8%|▊ | 59/704 [00:27<05:00, 2.15it/s, v_num=11, train_loss=4.610]
Epoch 0: 9%|▊ | 60/704 [00:27<04:55, 2.18it/s, v_num=11, train_loss=4.610]
Epoch 0: 9%|▊ | 60/704 [00:27<04:59, 2.15it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▊ | 61/704 [00:27<04:54, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▊ | 61/704 [00:28<04:58, 2.15it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 62/704 [00:28<04:53, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 62/704 [00:28<04:57, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 63/704 [00:28<04:53, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 63/704 [00:29<04:56, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 64/704 [00:29<04:52, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 64/704 [00:29<04:56, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 65/704 [00:29<04:51, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 65/704 [00:30<04:55, 2.16it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 66/704 [00:30<04:50, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 9%|▉ | 66/704 [00:30<04:54, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 67/704 [00:30<04:50, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 67/704 [00:30<04:53, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 68/704 [00:30<04:49, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 68/704 [00:31<04:52, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 69/704 [00:31<04:48, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 69/704 [00:31<04:52, 2.17it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 70/704 [00:31<04:47, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|▉ | 70/704 [00:32<04:51, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|█ | 71/704 [00:32<04:47, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|█ | 71/704 [00:32<04:50, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|█ | 72/704 [00:32<04:46, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|█ | 72/704 [00:33<04:49, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|█ | 73/704 [00:33<04:45, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 10%|█ | 73/704 [00:33<04:49, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 74/704 [00:33<04:44, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 74/704 [00:33<04:48, 2.18it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 75/704 [00:33<04:44, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 75/704 [00:34<04:47, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 76/704 [00:34<04:44, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 76/704 [00:34<04:46, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 77/704 [00:34<04:43, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 77/704 [00:35<04:46, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 78/704 [00:35<04:42, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 78/704 [00:35<04:45, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 79/704 [00:35<04:42, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█ | 79/704 [00:36<04:44, 2.19it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█▏ | 80/704 [00:36<04:41, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 11%|█▏ | 80/704 [00:36<04:44, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 81/704 [00:36<04:40, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 81/704 [00:36<04:43, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 82/704 [00:36<04:39, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 82/704 [00:37<04:42, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 83/704 [00:37<04:39, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 83/704 [00:37<04:42, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 84/704 [00:37<04:38, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 84/704 [00:38<04:41, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 85/704 [00:38<04:38, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 85/704 [00:38<04:40, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 86/704 [00:38<04:37, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 86/704 [00:39<04:40, 2.20it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 87/704 [00:39<04:36, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▏ | 87/704 [00:39<04:39, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▎ | 88/704 [00:39<04:36, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 12%|█▎ | 88/704 [00:39<04:39, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 89/704 [00:39<04:36, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 89/704 [00:40<04:38, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 90/704 [00:40<04:35, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 90/704 [00:40<04:37, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 91/704 [00:40<04:34, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 91/704 [00:41<04:37, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 92/704 [00:41<04:34, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 92/704 [00:41<04:36, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 93/704 [00:41<04:33, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 93/704 [00:42<04:35, 2.21it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 94/704 [00:42<04:32, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 94/704 [00:42<04:35, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 95/704 [00:42<04:32, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 13%|█▎ | 95/704 [00:42<04:34, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▎ | 96/704 [00:42<04:31, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▎ | 96/704 [00:43<04:34, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 97/704 [00:43<04:31, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 97/704 [00:43<04:33, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 98/704 [00:43<04:30, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 98/704 [00:44<04:32, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 99/704 [00:44<04:30, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 99/704 [00:44<04:32, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 100/704 [00:44<04:29, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 100/704 [00:45<04:32, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 101/704 [00:45<04:29, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 101/704 [00:45<04:31, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 102/704 [00:45<04:28, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 14%|█▍ | 102/704 [00:45<04:30, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▍ | 103/704 [00:45<04:28, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▍ | 103/704 [00:46<04:30, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▍ | 104/704 [00:46<04:27, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▍ | 104/704 [00:46<04:29, 2.22it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▍ | 105/704 [00:46<04:26, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▍ | 105/704 [00:47<04:29, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 106/704 [00:47<04:26, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 106/704 [00:47<04:28, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 107/704 [00:47<04:25, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 107/704 [00:48<04:28, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 108/704 [00:48<04:25, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 108/704 [00:48<04:27, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 109/704 [00:48<04:24, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 15%|█▌ | 109/704 [00:48<04:26, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 110/704 [00:48<04:24, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 110/704 [00:49<04:26, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 111/704 [00:49<04:23, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 111/704 [00:49<04:25, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 112/704 [00:49<04:23, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 112/704 [00:50<04:25, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 113/704 [00:50<04:22, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 113/704 [00:50<04:24, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 114/704 [00:50<04:22, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▌ | 114/704 [00:51<04:24, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▋ | 115/704 [00:51<04:21, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▋ | 115/704 [00:51<04:23, 2.23it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▋ | 116/704 [00:51<04:21, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 16%|█▋ | 116/704 [00:51<04:23, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 117/704 [00:51<04:20, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 117/704 [00:52<04:22, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 118/704 [00:52<04:20, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 118/704 [00:52<04:21, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 119/704 [00:52<04:19, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 119/704 [00:53<04:21, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 120/704 [00:53<04:19, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 120/704 [00:53<04:20, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 121/704 [00:53<04:18, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 121/704 [00:54<04:20, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 122/704 [00:54<04:18, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 122/704 [00:54<04:19, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 123/704 [00:54<04:17, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 17%|█▋ | 123/704 [00:54<04:19, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 124/704 [00:54<04:17, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 124/704 [00:55<04:18, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 125/704 [00:55<04:16, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 125/704 [00:55<04:18, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 126/704 [00:55<04:15, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 126/704 [00:56<04:17, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 127/704 [00:56<04:15, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 127/704 [00:56<04:17, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 128/704 [00:56<04:15, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 128/704 [00:57<04:16, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 129/704 [00:57<04:14, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 129/704 [00:57<04:16, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 130/704 [00:57<04:13, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 18%|█▊ | 130/704 [00:57<04:15, 2.24it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▊ | 131/704 [00:57<04:13, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▊ | 131/704 [00:58<04:15, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 132/704 [00:58<04:12, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 132/704 [00:58<04:14, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 133/704 [00:58<04:12, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 133/704 [00:59<04:14, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 134/704 [00:59<04:12, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 134/704 [00:59<04:13, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 135/704 [00:59<04:11, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 135/704 [01:00<04:13, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 136/704 [01:00<04:11, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 136/704 [01:00<04:12, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 137/704 [01:00<04:10, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 19%|█▉ | 137/704 [01:00<04:12, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|█▉ | 138/704 [01:01<04:10, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|█▉ | 138/704 [01:01<04:11, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|█▉ | 139/704 [01:01<04:09, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|█▉ | 139/704 [01:01<04:11, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|█▉ | 140/704 [01:01<04:09, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|█▉ | 140/704 [01:02<04:10, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 141/704 [01:02<04:08, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 141/704 [01:02<04:10, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 142/704 [01:02<04:08, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 142/704 [01:03<04:09, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 143/704 [01:03<04:07, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 143/704 [01:03<04:09, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 144/704 [01:03<04:07, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 20%|██ | 144/704 [01:03<04:08, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 145/704 [01:04<04:06, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 145/704 [01:04<04:08, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 146/704 [01:04<04:06, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 146/704 [01:04<04:07, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 147/704 [01:04<04:05, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 147/704 [01:05<04:07, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 148/704 [01:05<04:05, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 148/704 [01:05<04:06, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 149/704 [01:05<04:04, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██ | 149/704 [01:06<04:06, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██▏ | 150/704 [01:06<04:04, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██▏ | 150/704 [01:06<04:05, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██▏ | 151/704 [01:06<04:04, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 21%|██▏ | 151/704 [01:07<04:05, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 152/704 [01:07<04:03, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 152/704 [01:07<04:04, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 153/704 [01:07<04:03, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 153/704 [01:07<04:04, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 154/704 [01:07<04:02, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 154/704 [01:08<04:03, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 155/704 [01:08<04:02, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 155/704 [01:08<04:03, 2.25it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 156/704 [01:08<04:01, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 156/704 [01:09<04:02, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 157/704 [01:09<04:01, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 157/704 [01:09<04:02, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 158/704 [01:09<04:00, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 22%|██▏ | 158/704 [01:10<04:01, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 159/704 [01:10<04:00, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 159/704 [01:10<04:01, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 160/704 [01:10<03:59, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 160/704 [01:10<04:01, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 161/704 [01:10<03:59, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 161/704 [01:11<04:00, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 162/704 [01:11<03:58, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 162/704 [01:11<04:00, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 163/704 [01:11<03:58, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 163/704 [01:12<03:59, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 164/704 [01:12<03:57, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 164/704 [01:12<03:59, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 165/704 [01:12<03:57, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 23%|██▎ | 165/704 [01:13<03:58, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▎ | 166/704 [01:13<03:56, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▎ | 166/704 [01:13<03:58, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▎ | 167/704 [01:13<03:56, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▎ | 167/704 [01:13<03:57, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 168/704 [01:13<03:55, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 168/704 [01:14<03:57, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 169/704 [01:14<03:55, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 169/704 [01:14<03:56, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 170/704 [01:14<03:55, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 170/704 [01:15<03:56, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 171/704 [01:15<03:54, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 171/704 [01:15<03:55, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 172/704 [01:15<03:54, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 24%|██▍ | 172/704 [01:16<03:55, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▍ | 173/704 [01:16<03:53, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▍ | 173/704 [01:16<03:54, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▍ | 174/704 [01:16<03:53, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▍ | 174/704 [01:16<03:54, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▍ | 175/704 [01:16<03:52, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▍ | 175/704 [01:17<03:53, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 176/704 [01:17<03:52, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 176/704 [01:17<03:53, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 177/704 [01:17<03:51, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 177/704 [01:18<03:52, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 178/704 [01:18<03:51, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 178/704 [01:18<03:52, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 179/704 [01:18<03:50, 2.27it/s, v_num=11, train_loss=4.600]
Epoch 0: 25%|██▌ | 179/704 [01:19<03:51, 2.26it/s, v_num=11, train_loss=4.600]
Epoch 0: 26%|██▌ | 180/704 [01:19<03:50, 2.27it/s, v_num=11, train_loss=4.600]
everything is not crashing, and the model summary looks good, but the training loss just doesn't change (different batch sample has a slight change, but not due to training of the model)
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0): 2.3.3
#- PyTorch Version (e.g., 2.4): 2.3.1
#- Python version (e.g., 3.12): 3.10.14
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version: 12.4
#- GPU models and configuration: 3090
#- How you installed Lightning(`conda`, `pip`, source): pip
The collect env script is not working, btw
Traceback (most recent call last):
File "/conda/envs/ai/lib/python3.10/site-packages/pkg_resources/_vendor/pyparsing.py", line 2711, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pkg_resources._vendor.pyparsing.ParseException: Expected W:(abcd...) (at char 0), (line:1, col:1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
raise InvalidRequirement(
pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse error at "'-cipy==1'": Expected W:(abcd...)
More info
No response
My full code is a little bit complicated, but I believe the problem is just within the above logics, did I use Lightning wrong in the above code?
grateful if somrone can give me any idea about what may cause such issue.
i thought about if the cls_model is errorly frozen, but it is not, parameters of it are requires_grad.
may be it is related to this issue https://github.com/Lightning-AI/pytorch-lightning/issues/20128
i am also using huggingface's automodel from pretrain, and mode is eval.
i tried to manually called training,but it doesnot work
No, it is not because of that issue. I double checked that I called nn.Module.train() ever since I use AutoModel.from_pretrained.
To debug, I print the parameters and gradients 's L2 norm every time training_step is called. Something interesting happens.
Grad Norm: 0.08665306866168976
Params Norm before step: 771.8257446289062
Params Norm after step: 771.8740234375
Grad Norm: 9.2427133654982e-12
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 1.773968298646178e-11
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 1.1152222808424872e-12
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 6.962481264270737e-13
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
Grad Norm: 1.828729181974076e-11
Params Norm before step: 771.8740234375
Params Norm after step: 771.8740234375
The optimizer indeed made a change to the model, which is self, the L.LightningModule instance. However, the gradient goes to very small somehow.
Can any experts kindly tell me where did I am use wrong of Lightning?
Hye,
Am not an expert, but I checked your code and you seem to do loss.backward() instead of self.manual_backward(loss) as stated in the documentation (https://lightning.ai/docs/pytorch/stable/model/manual_optimization.html#manual-optimization).
Can you see if this helps?
I think it is fair to conclude that the issue is not with lightning here. The gradients are being correctly updated and backpropergated however they are very small. I am going to assume that you are experiencing some kind of vanishing gradient problem due to the model being used. Please make sure that:
- calling
.trainon models before callingtrainer.fit - if you are doing manual optimization that you are calling
self.manual_backward - I noticed that you are using
nn.Softmaxin combination withnn.CrossEntropyLosswhich is not correct. Cross entropy loss expects the input tensor to be logits not probabilities.
Closing issue, but feel free to ping and reopen if necessary. We are probably going to need a fully reproducible example to be able to help more.