Stable-Pix2Seq icon indicating copy to clipboard operation
Stable-Pix2Seq copied to clipboard

CUDA Out-of-memory using V100

Open allanj opened this issue 1 year ago • 1 comments

I'm using V100 for experiments, but still out of memory in the middle of the training process. Not sure what would be the reason at this momnet


Namespace(aux_loss=True, backbone='resnet50', batch_size=4, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='./coco2017/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0005, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='./output', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/tiger/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:09<00:00, 10.3MB/s]
number of params: 36104659
loading annotations into memory...
Done (t=13.57s)
creating index...
index created!
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Start training
Epoch: [0]  [   0/3696]  eta: 2:32:25  lr: 0.000100  loss: 7.6000 (7.6000)  at: 7.6000 (7.6000)  at_unscaled: 7.6000 (7.6000)  time: 2.4743  data: 0.5030  max mem: 14737
Epoch: [0]  [  10/3696]  eta: 0:59:14  lr: 0.000100  loss: 7.5261 (7.5307)  at: 7.5261 (7.5307)  at_unscaled: 7.5261 (7.5307)  time: 0.9643  data: 0.0806  max mem: 25656
Epoch: [0]  [  20/3696]  eta: 0:56:49  lr: 0.000100  loss: 7.4746 (7.4774)  at: 7.4746 (7.4774)  at_unscaled: 7.4746 (7.4774)  time: 0.8501  data: 0.0390  max mem: 25656
Epoch: [0]  [  30/3696]  eta: 0:54:22  lr: 0.000100  loss: 7.3449 (7.4215)  at: 7.3449 (7.4215)  at_unscaled: 7.3449 (7.4215)  time: 0.8489  data: 0.0374  max mem: 25656
Epoch: [0]  [  40/3696]  eta: 0:54:59  lr: 0.000100  loss: 7.2054 (7.3429)  at: 7.2054 (7.3429)  at_unscaled: 7.2054 (7.3429)  time: 0.8761  data: 0.0356  max mem: 25656
Epoch: [0]  [  50/3696]  eta: 0:53:30  lr: 0.000100  loss: 7.0288 (7.2657)  at: 7.0288 (7.2657)  at_unscaled: 7.0288 (7.2657)  time: 0.8662  data: 0.0362  max mem: 25656
Epoch: [0]  [  60/3696]  eta: 0:53:44  lr: 0.000100  loss: 6.8423 (7.1774)  at: 6.8423 (7.1774)  at_unscaled: 6.8423 (7.1774)  time: 0.8553  data: 0.0368  max mem: 26623
Epoch: [0]  [  70/3696]  eta: 0:53:36  lr: 0.000100  loss: 6.6867 (7.0967)  at: 6.6867 (7.0967)  at_unscaled: 6.6867 (7.0967)  time: 0.9036  data: 0.0359  max mem: 26623
Epoch: [0]  [  80/3696]  eta: 0:52:42  lr: 0.000100  loss: 6.5043 (7.0184)  at: 6.5043 (7.0184)  at_unscaled: 6.5043 (7.0184)  time: 0.8368  data: 0.0351  max mem: 26623
Epoch: [0]  [  90/3696]  eta: 0:52:17  lr: 0.000100  loss: 6.4531 (6.9577)  at: 6.4531 (6.9577)  at_unscaled: 6.4531 (6.9577)  time: 0.8094  data: 0.0362  max mem: 26623
Epoch: [0]  [ 100/3696]  eta: 0:51:33  lr: 0.000100  loss: 6.4151 (6.8982)  at: 6.4151 (6.8982)  at_unscaled: 6.4151 (6.8982)  time: 0.8019  data: 0.0386  max mem: 26623
Epoch: [0]  [ 110/3696]  eta: 0:51:10  lr: 0.000100  loss: 6.3319 (6.8437)  at: 6.3319 (6.8437)  at_unscaled: 6.3319 (6.8437)  time: 0.7937  data: 0.0392  max mem: 26623
Epoch: [0]  [ 120/3696]  eta: 0:50:56  lr: 0.000100  loss: 6.2714 (6.7969)  at: 6.2714 (6.7969)  at_unscaled: 6.2714 (6.7969)  time: 0.8268  data: 0.0377  max mem: 26623
Epoch: [0]  [ 130/3696]  eta: 0:50:36  lr: 0.000100  loss: 6.2584 (6.7519)  at: 6.2584 (6.7519)  at_unscaled: 6.2584 (6.7519)  time: 0.8254  data: 0.0372  max mem: 26623
Epoch: [0]  [ 140/3696]  eta: 0:50:25  lr: 0.000100  loss: 6.2035 (6.7111)  at: 6.2035 (6.7111)  at_unscaled: 6.2035 (6.7111)  time: 0.8266  data: 0.0372  max mem: 29528
Epoch: [0]  [ 150/3696]  eta: 0:49:55  lr: 0.000100  loss: 6.1476 (6.6716)  at: 6.1476 (6.6716)  at_unscaled: 6.1476 (6.6716)  time: 0.8011  data: 0.0375  max mem: 29528
Epoch: [0]  [ 160/3696]  eta: 0:49:27  lr: 0.000100  loss: 6.0711 (6.6330)  at: 6.0711 (6.6330)  at_unscaled: 6.0711 (6.6330)  time: 0.7585  data: 0.0372  max mem: 29528
Epoch: [0]  [ 170/3696]  eta: 0:49:10  lr: 0.000100  loss: 6.0247 (6.5969)  at: 6.0247 (6.5969)  at_unscaled: 6.0247 (6.5969)  time: 0.7769  data: 0.0358  max mem: 29528
Epoch: [0]  [ 180/3696]  eta: 0:49:27  lr: 0.000100  loss: 5.9822 (6.5631)  at: 5.9822 (6.5631)  at_unscaled: 5.9822 (6.5631)  time: 0.8812  data: 0.0361  max mem: 29528
Epoch: [0]  [ 190/3696]  eta: 0:49:06  lr: 0.000100  loss: 5.9351 (6.5278)  at: 5.9351 (6.5278)  at_unscaled: 5.9351 (6.5278)  time: 0.8712  data: 0.0371  max mem: 29528
Epoch: [0]  [ 200/3696]  eta: 0:48:45  lr: 0.000100  loss: 5.8904 (6.4953)  at: 5.8904 (6.4953)  at_unscaled: 5.8904 (6.4953)  time: 0.7744  data: 0.0355  max mem: 29528
Epoch: [0]  [ 210/3696]  eta: 0:48:35  lr: 0.000100  loss: 5.8645 (6.4635)  at: 5.8645 (6.4635)  at_unscaled: 5.8645 (6.4635)  time: 0.7968  data: 0.0348  max mem: 29528
Epoch: [0]  [ 220/3696]  eta: 0:48:17  lr: 0.000100  loss: 5.8032 (6.4343)  at: 5.8032 (6.4343)  at_unscaled: 5.8032 (6.4343)  time: 0.7998  data: 0.0354  max mem: 29528
Epoch: [0]  [ 230/3696]  eta: 0:47:58  lr: 0.000100  loss: 5.7949 (6.4067)  at: 5.7949 (6.4067)  at_unscaled: 5.7949 (6.4067)  time: 0.7687  data: 0.0362  max mem: 29528
Epoch: [0]  [ 240/3696]  eta: 0:47:45  lr: 0.000100  loss: 5.7568 (6.3776)  at: 5.7568 (6.3776)  at_unscaled: 5.7568 (6.3776)  time: 0.7808  data: 0.0371  max mem: 29528
Epoch: [0]  [ 250/3696]  eta: 0:47:30  lr: 0.000100  loss: 5.7063 (6.3502)  at: 5.7063 (6.3502)  at_unscaled: 5.7063 (6.3502)  time: 0.7889  data: 0.0366  max mem: 29528
Epoch: [0]  [ 260/3696]  eta: 0:47:11  lr: 0.000100  loss: 5.6821 (6.3225)  at: 5.6821 (6.3225)  at_unscaled: 5.6821 (6.3225)  time: 0.7617  data: 0.0362  max mem: 29528
Epoch: [0]  [ 270/3696]  eta: 0:47:00  lr: 0.000100  loss: 5.6091 (6.2965)  at: 5.6091 (6.2965)  at_unscaled: 5.6091 (6.2965)  time: 0.7725  data: 0.0366  max mem: 29528
Epoch: [0]  [ 280/3696]  eta: 0:46:48  lr: 0.000100  loss: 5.6024 (6.2713)  at: 5.6024 (6.2713)  at_unscaled: 5.6024 (6.2713)  time: 0.7982  data: 0.0366  max mem: 29528
Epoch: [0]  [ 290/3696]  eta: 0:46:48  lr: 0.000100  loss: 5.5578 (6.2455)  at: 5.5578 (6.2455)  at_unscaled: 5.5578 (6.2455)  time: 0.8433  data: 0.0370  max mem: 29528
Epoch: [0]  [ 300/3696]  eta: 0:46:36  lr: 0.000100  loss: 5.5396 (6.2221)  at: 5.5396 (6.2221)  at_unscaled: 5.5396 (6.2221)  time: 0.8398  data: 0.0373  max mem: 29528
Epoch: [0]  [ 310/3696]  eta: 0:46:23  lr: 0.000100  loss: 5.5059 (6.1994)  at: 5.5059 (6.1994)  at_unscaled: 5.5059 (6.1994)  time: 0.7842  data: 0.0374  max mem: 29528
Epoch: [0]  [ 320/3696]  eta: 0:46:12  lr: 0.000100  loss: 5.4888 (6.1767)  at: 5.4888 (6.1767)  at_unscaled: 5.4888 (6.1767)  time: 0.7882  data: 0.0370  max mem: 29528
Epoch: [0]  [ 330/3696]  eta: 0:45:58  lr: 0.000100  loss: 5.4756 (6.1560)  at: 5.4756 (6.1560)  at_unscaled: 5.4756 (6.1560)  time: 0.7820  data: 0.0365  max mem: 29528
Epoch: [0]  [ 340/3696]  eta: 0:45:49  lr: 0.000100  loss: 5.4458 (6.1354)  at: 5.4458 (6.1354)  at_unscaled: 5.4458 (6.1354)  time: 0.7886  data: 0.0363  max mem: 29528
Epoch: [0]  [ 350/3696]  eta: 0:45:42  lr: 0.000100  loss: 5.4504 (6.1157)  at: 5.4504 (6.1157)  at_unscaled: 5.4504 (6.1157)  time: 0.8230  data: 0.0364  max mem: 29528
Epoch: [0]  [ 360/3696]  eta: 0:45:34  lr: 0.000100  loss: 5.4683 (6.0973)  at: 5.4683 (6.0973)  at_unscaled: 5.4683 (6.0973)  time: 0.8292  data: 0.0370  max mem: 29528
Epoch: [0]  [ 370/3696]  eta: 0:45:30  lr: 0.000100  loss: 5.4665 (6.0802)  at: 5.4665 (6.0802)  at_unscaled: 5.4665 (6.0802)  time: 0.8410  data: 0.0357  max mem: 29528
Epoch: [0]  [ 380/3696]  eta: 0:45:22  lr: 0.000100  loss: 5.4943 (6.0647)  at: 5.4943 (6.0647)  at_unscaled: 5.4943 (6.0647)  time: 0.8443  data: 0.0360  max mem: 29528
Epoch: [0]  [ 390/3696]  eta: 0:45:13  lr: 0.000100  loss: 5.4801 (6.0489)  at: 5.4801 (6.0489)  at_unscaled: 5.4801 (6.0489)  time: 0.8209  data: 0.0371  max mem: 29528
Epoch: [0]  [ 400/3696]  eta: 0:45:14  lr: 0.000100  loss: 5.4442 (6.0338)  at: 5.4442 (6.0338)  at_unscaled: 5.4442 (6.0338)  time: 0.8706  data: 0.0372  max mem: 29528
Epoch: [0]  [ 410/3696]  eta: 0:45:03  lr: 0.000100  loss: 5.4351 (6.0182)  at: 5.4351 (6.0182)  at_unscaled: 5.4351 (6.0182)  time: 0.8613  data: 0.0376  max mem: 29528
Epoch: [0]  [ 420/3696]  eta: 0:44:50  lr: 0.000100  loss: 5.3845 (6.0028)  at: 5.3845 (6.0028)  at_unscaled: 5.3845 (6.0028)  time: 0.7759  data: 0.0373  max mem: 29528
Epoch: [0]  [ 430/3696]  eta: 0:45:03  lr: 0.000100  loss: 5.3922 (5.9884)  at: 5.3922 (5.9884)  at_unscaled: 5.3922 (5.9884)  time: 0.9318  data: 0.0361  max mem: 29528
Epoch: [0]  [ 440/3696]  eta: 0:44:50  lr: 0.000100  loss: 5.4115 (5.9759)  at: 5.4115 (5.9759)  at_unscaled: 5.4115 (5.9759)  time: 0.9331  data: 0.0361  max mem: 29528
Epoch: [0]  [ 450/3696]  eta: 0:44:43  lr: 0.000100  loss: 5.4180 (5.9631)  at: 5.4180 (5.9631)  at_unscaled: 5.4180 (5.9631)  time: 0.8017  data: 0.0359  max mem: 29528
Epoch: [0]  [ 460/3696]  eta: 0:44:29  lr: 0.000100  loss: 5.3881 (5.9501)  at: 5.3881 (5.9501)  at_unscaled: 5.3881 (5.9501)  time: 0.7948  data: 0.0355  max mem: 29528
Epoch: [0]  [ 470/3696]  eta: 0:44:18  lr: 0.000100  loss: 5.3906 (5.9391)  at: 5.3906 (5.9391)  at_unscaled: 5.3906 (5.9391)  time: 0.7668  data: 0.0371  max mem: 29528
Epoch: [0]  [ 480/3696]  eta: 0:44:10  lr: 0.000100  loss: 5.3906 (5.9277)  at: 5.3906 (5.9277)  at_unscaled: 5.3906 (5.9277)  time: 0.8013  data: 0.0390  max mem: 29528
Epoch: [0]  [ 490/3696]  eta: 0:44:03  lr: 0.000100  loss: 5.4143 (5.9179)  at: 5.4143 (5.9179)  at_unscaled: 5.4143 (5.9179)  time: 0.8300  data: 0.0391  max mem: 29528
Epoch: [0]  [ 500/3696]  eta: 0:43:54  lr: 0.000100  loss: 5.4093 (5.9075)  at: 5.4093 (5.9075)  at_unscaled: 5.4093 (5.9075)  time: 0.8303  data: 0.0378  max mem: 29528
Epoch: [0]  [ 510/3696]  eta: 0:43:43  lr: 0.000100  loss: 5.3890 (5.8972)  at: 5.3890 (5.8972)  at_unscaled: 5.3890 (5.8972)  time: 0.7958  data: 0.0367  max mem: 29528
Epoch: [0]  [ 520/3696]  eta: 0:43:31  lr: 0.000100  loss: 5.3959 (5.8872)  at: 5.3959 (5.8872)  at_unscaled: 5.3959 (5.8872)  time: 0.7730  data: 0.0355  max mem: 29528
Epoch: [0]  [ 530/3696]  eta: 0:43:22  lr: 0.000100  loss: 5.3743 (5.8775)  at: 5.3743 (5.8775)  at_unscaled: 5.3743 (5.8775)  time: 0.7915  data: 0.0358  max mem: 29528
Epoch: [0]  [ 540/3696]  eta: 0:43:12  lr: 0.000100  loss: 5.3725 (5.8675)  at: 5.3725 (5.8675)  at_unscaled: 5.3725 (5.8675)  time: 0.8013  data: 0.0355  max mem: 29528
Epoch: [0]  [ 550/3696]  eta: 0:43:02  lr: 0.000100  loss: 5.3403 (5.8580)  at: 5.3403 (5.8580)  at_unscaled: 5.3403 (5.8580)  time: 0.7922  data: 0.0349  max mem: 29528
Epoch: [0]  [ 560/3696]  eta: 0:42:52  lr: 0.000100  loss: 5.3460 (5.8494)  at: 5.3460 (5.8494)  at_unscaled: 5.3460 (5.8494)  time: 0.7893  data: 0.0355  max mem: 29528
Epoch: [0]  [ 570/3696]  eta: 0:42:43  lr: 0.000100  loss: 5.3509 (5.8408)  at: 5.3509 (5.8408)  at_unscaled: 5.3509 (5.8408)  time: 0.7901  data: 0.0359  max mem: 29528
Epoch: [0]  [ 580/3696]  eta: 0:42:31  lr: 0.000100  loss: 5.3509 (5.8328)  at: 5.3509 (5.8328)  at_unscaled: 5.3509 (5.8328)  time: 0.7762  data: 0.0358  max mem: 29528
Epoch: [0]  [ 590/3696]  eta: 0:42:22  lr: 0.000100  loss: 5.3572 (5.8243)  at: 5.3572 (5.8243)  at_unscaled: 5.3572 (5.8243)  time: 0.7785  data: 0.0351  max mem: 29528
Epoch: [0]  [ 600/3696]  eta: 0:42:11  lr: 0.000100  loss: 5.3541 (5.8163)  at: 5.3541 (5.8163)  at_unscaled: 5.3541 (5.8163)  time: 0.7857  data: 0.0343  max mem: 29528
Epoch: [0]  [ 610/3696]  eta: 0:41:59  lr: 0.000100  loss: 5.3445 (5.8085)  at: 5.3445 (5.8085)  at_unscaled: 5.3445 (5.8085)  time: 0.7585  data: 0.0351  max mem: 29528
Epoch: [0]  [ 620/3696]  eta: 0:41:54  lr: 0.000100  loss: 5.3499 (5.8015)  at: 5.3499 (5.8015)  at_unscaled: 5.3499 (5.8015)  time: 0.8055  data: 0.0354  max mem: 29528
Epoch: [0]  [ 630/3696]  eta: 0:41:42  lr: 0.000100  loss: 5.3499 (5.7940)  at: 5.3499 (5.7940)  at_unscaled: 5.3499 (5.7940)  time: 0.8031  data: 0.0343  max mem: 29528
Epoch: [0]  [ 640/3696]  eta: 0:41:31  lr: 0.000100  loss: 5.3273 (5.7865)  at: 5.3273 (5.7865)  at_unscaled: 5.3273 (5.7865)  time: 0.7553  data: 0.0356  max mem: 29528
Epoch: [0]  [ 650/3696]  eta: 0:41:22  lr: 0.000100  loss: 5.3314 (5.7792)  at: 5.3314 (5.7792)  at_unscaled: 5.3314 (5.7792)  time: 0.7825  data: 0.0378  max mem: 29528
Epoch: [0]  [ 660/3696]  eta: 0:41:16  lr: 0.000100  loss: 5.3259 (5.7719)  at: 5.3259 (5.7719)  at_unscaled: 5.3259 (5.7719)  time: 0.8199  data: 0.0371  max mem: 29528
Epoch: [0]  [ 670/3696]  eta: 0:41:06  lr: 0.000100  loss: 5.2930 (5.7651)  at: 5.2930 (5.7651)  at_unscaled: 5.2930 (5.7651)  time: 0.8170  data: 0.0351  max mem: 29528
Epoch: [0]  [ 680/3696]  eta: 0:40:57  lr: 0.000100  loss: 5.2930 (5.7582)  at: 5.2930 (5.7582)  at_unscaled: 5.2930 (5.7582)  time: 0.7851  data: 0.0354  max mem: 29528
Epoch: [0]  [ 690/3696]  eta: 0:40:49  lr: 0.000100  loss: 5.2727 (5.7514)  at: 5.2727 (5.7514)  at_unscaled: 5.2727 (5.7514)  time: 0.8068  data: 0.0353  max mem: 29528
Epoch: [0]  [ 700/3696]  eta: 0:40:41  lr: 0.000100  loss: 5.2917 (5.7451)  at: 5.2917 (5.7451)  at_unscaled: 5.2917 (5.7451)  time: 0.8184  data: 0.0348  max mem: 29528
Epoch: [0]  [ 710/3696]  eta: 0:40:31  lr: 0.000100  loss: 5.2949 (5.7387)  at: 5.2949 (5.7387)  at_unscaled: 5.2949 (5.7387)  time: 0.7904  data: 0.0358  max mem: 29528
Epoch: [0]  [ 720/3696]  eta: 0:40:21  lr: 0.000100  loss: 5.2874 (5.7325)  at: 5.2874 (5.7325)  at_unscaled: 5.2874 (5.7325)  time: 0.7719  data: 0.0376  max mem: 29528
Epoch: [0]  [ 730/3696]  eta: 0:40:10  lr: 0.000100  loss: 5.2801 (5.7262)  at: 5.2801 (5.7262)  at_unscaled: 5.2801 (5.7262)  time: 0.7581  data: 0.0372  max mem: 29528
Epoch: [0]  [ 740/3696]  eta: 0:40:02  lr: 0.000100  loss: 5.2634 (5.7196)  at: 5.2634 (5.7196)  at_unscaled: 5.2634 (5.7196)  time: 0.7769  data: 0.0357  max mem: 29528
Epoch: [0]  [ 750/3696]  eta: 0:39:53  lr: 0.000100  loss: 5.2367 (5.7135)  at: 5.2367 (5.7135)  at_unscaled: 5.2367 (5.7135)  time: 0.8039  data: 0.0365  max mem: 29528
Epoch: [0]  [ 760/3696]  eta: 0:39:43  lr: 0.000100  loss: 5.2874 (5.7082)  at: 5.2874 (5.7082)  at_unscaled: 5.2874 (5.7082)  time: 0.7800  data: 0.0367  max mem: 29528
Epoch: [0]  [ 770/3696]  eta: 0:39:33  lr: 0.000100  loss: 5.2954 (5.7024)  at: 5.2954 (5.7024)  at_unscaled: 5.2954 (5.7024)  time: 0.7681  data: 0.0356  max mem: 29528
Epoch: [0]  [ 780/3696]  eta: 0:39:23  lr: 0.000100  loss: 5.3127 (5.6975)  at: 5.3127 (5.6975)  at_unscaled: 5.3127 (5.6975)  time: 0.7632  data: 0.0361  max mem: 29528
Epoch: [0]  [ 790/3696]  eta: 0:39:14  lr: 0.000100  loss: 5.3130 (5.6919)  at: 5.3130 (5.6919)  at_unscaled: 5.3130 (5.6919)  time: 0.7715  data: 0.0359  max mem: 29528
Epoch: [0]  [ 800/3696]  eta: 0:39:06  lr: 0.000100  loss: 5.2498 (5.6860)  at: 5.2498 (5.6860)  at_unscaled: 5.2498 (5.6860)  time: 0.7954  data: 0.0369  max mem: 29528
Epoch: [0]  [ 810/3696]  eta: 0:38:58  lr: 0.000100  loss: 5.2336 (5.6804)  at: 5.2336 (5.6804)  at_unscaled: 5.2336 (5.6804)  time: 0.8095  data: 0.0380  max mem: 29528
Epoch: [0]  [ 820/3696]  eta: 0:38:50  lr: 0.000100  loss: 5.2354 (5.6755)  at: 5.2354 (5.6755)  at_unscaled: 5.2354 (5.6755)  time: 0.8130  data: 0.0356  max mem: 29528
Epoch: [0]  [ 830/3696]  eta: 0:38:39  lr: 0.000100  loss: 5.2691 (5.6704)  at: 5.2691 (5.6704)  at_unscaled: 5.2691 (5.6704)  time: 0.7757  data: 0.0355  max mem: 29528
Epoch: [0]  [ 840/3696]  eta: 0:38:31  lr: 0.000100  loss: 5.2588 (5.6653)  at: 5.2588 (5.6653)  at_unscaled: 5.2588 (5.6653)  time: 0.7692  data: 0.0369  max mem: 29528
Epoch: [0]  [ 850/3696]  eta: 0:38:23  lr: 0.000100  loss: 5.2564 (5.6606)  at: 5.2564 (5.6606)  at_unscaled: 5.2564 (5.6606)  time: 0.8133  data: 0.0363  max mem: 29528
Epoch: [0]  [ 860/3696]  eta: 0:38:15  lr: 0.000100  loss: 5.2448 (5.6556)  at: 5.2448 (5.6556)  at_unscaled: 5.2448 (5.6556)  time: 0.8129  data: 0.0352  max mem: 29528
Epoch: [0]  [ 870/3696]  eta: 0:38:05  lr: 0.000100  loss: 5.2326 (5.6506)  at: 5.2326 (5.6506)  at_unscaled: 5.2326 (5.6506)  time: 0.7795  data: 0.0351  max mem: 29528
Epoch: [0]  [ 880/3696]  eta: 0:37:56  lr: 0.000100  loss: 5.2049 (5.6456)  at: 5.2049 (5.6456)  at_unscaled: 5.2049 (5.6456)  time: 0.7750  data: 0.0364  max mem: 29528
Epoch: [0]  [ 890/3696]  eta: 0:37:47  lr: 0.000100  loss: 5.2049 (5.6407)  at: 5.2049 (5.6407)  at_unscaled: 5.2049 (5.6407)  time: 0.7812  data: 0.0367  max mem: 29528
Epoch: [0]  [ 900/3696]  eta: 0:37:37  lr: 0.000100  loss: 5.1690 (5.6354)  at: 5.1690 (5.6354)  at_unscaled: 5.1690 (5.6354)  time: 0.7607  data: 0.0348  max mem: 29528
Epoch: [0]  [ 910/3696]  eta: 0:37:31  lr: 0.000100  loss: 5.1836 (5.6309)  at: 5.1836 (5.6309)  at_unscaled: 5.1836 (5.6309)  time: 0.8035  data: 0.0355  max mem: 29528
Epoch: [0]  [ 920/3696]  eta: 0:37:22  lr: 0.000100  loss: 5.2129 (5.6261)  at: 5.2129 (5.6261)  at_unscaled: 5.2129 (5.6261)  time: 0.8221  data: 0.0381  max mem: 29528
Epoch: [0]  [ 930/3696]  eta: 0:37:13  lr: 0.000100  loss: 5.1586 (5.6210)  at: 5.1586 (5.6210)  at_unscaled: 5.1586 (5.6210)  time: 0.7758  data: 0.0377  max mem: 29528
Epoch: [0]  [ 940/3696]  eta: 0:37:05  lr: 0.000100  loss: 5.1586 (5.6162)  at: 5.1586 (5.6162)  at_unscaled: 5.1586 (5.6162)  time: 0.7975  data: 0.0355  max mem: 29528
Epoch: [0]  [ 950/3696]  eta: 0:36:56  lr: 0.000100  loss: 5.1713 (5.6120)  at: 5.1713 (5.6120)  at_unscaled: 5.1713 (5.6120)  time: 0.7970  data: 0.0358  max mem: 29528
Epoch: [0]  [ 960/3696]  eta: 0:36:47  lr: 0.000100  loss: 5.1839 (5.6077)  at: 5.1839 (5.6077)  at_unscaled: 5.1839 (5.6077)  time: 0.7714  data: 0.0367  max mem: 29528
Epoch: [0]  [ 970/3696]  eta: 0:36:38  lr: 0.000100  loss: 5.1800 (5.6036)  at: 5.1800 (5.6036)  at_unscaled: 5.1800 (5.6036)  time: 0.7812  data: 0.0363  max mem: 29528
Epoch: [0]  [ 980/3696]  eta: 0:36:30  lr: 0.000100  loss: 5.2028 (5.5995)  at: 5.2028 (5.5995)  at_unscaled: 5.2028 (5.5995)  time: 0.7996  data: 0.0349  max mem: 29528
Epoch: [0]  [ 990/3696]  eta: 0:36:23  lr: 0.000100  loss: 5.2028 (5.5954)  at: 5.2028 (5.5954)  at_unscaled: 5.2028 (5.5954)  time: 0.8110  data: 0.0353  max mem: 29528
Epoch: [0]  [1000/3696]  eta: 0:36:14  lr: 0.000100  loss: 5.1880 (5.5914)  at: 5.1880 (5.5914)  at_unscaled: 5.1880 (5.5914)  time: 0.7950  data: 0.0369  max mem: 29528
Epoch: [0]  [1010/3696]  eta: 0:36:04  lr: 0.000100  loss: 5.1773 (5.5870)  at: 5.1773 (5.5870)  at_unscaled: 5.1773 (5.5870)  time: 0.7645  data: 0.0368  max mem: 29528
Epoch: [0]  [1020/3696]  eta: 0:35:57  lr: 0.000100  loss: 5.2493 (5.5836)  at: 5.2493 (5.5836)  at_unscaled: 5.2493 (5.5836)  time: 0.7915  data: 0.0360  max mem: 29528
Epoch: [0]  [1030/3696]  eta: 0:35:49  lr: 0.000100  loss: 5.1982 (5.5793)  at: 5.1982 (5.5793)  at_unscaled: 5.1982 (5.5793)  time: 0.8164  data: 0.0363  max mem: 29528
Epoch: [0]  [1040/3696]  eta: 0:35:41  lr: 0.000100  loss: 5.1446 (5.5754)  at: 5.1446 (5.5754)  at_unscaled: 5.1446 (5.5754)  time: 0.8053  data: 0.0375  max mem: 29528
Epoch: [0]  [1050/3696]  eta: 0:35:31  lr: 0.000100  loss: 5.1319 (5.5714)  at: 5.1319 (5.5714)  at_unscaled: 5.1319 (5.5714)  time: 0.7766  data: 0.0359  max mem: 29528
Epoch: [0]  [1060/3696]  eta: 0:35:22  lr: 0.000100  loss: 5.2017 (5.5679)  at: 5.2017 (5.5679)  at_unscaled: 5.2017 (5.5679)  time: 0.7481  data: 0.0365  max mem: 29528
Epoch: [0]  [1070/3696]  eta: 0:35:13  lr: 0.000100  loss: 5.2017 (5.5642)  at: 5.2017 (5.5642)  at_unscaled: 5.2017 (5.5642)  time: 0.7754  data: 0.0387  max mem: 29528
Epoch: [0]  [1080/3696]  eta: 0:35:03  lr: 0.000100  loss: 5.1192 (5.5603)  at: 5.1192 (5.5603)  at_unscaled: 5.1192 (5.5603)  time: 0.7605  data: 0.0383  max mem: 29528
Epoch: [0]  [1090/3696]  eta: 0:34:56  lr: 0.000100  loss: 5.1105 (5.5560)  at: 5.1105 (5.5560)  at_unscaled: 5.1105 (5.5560)  time: 0.7700  data: 0.0379  max mem: 29528
Epoch: [0]  [1100/3696]  eta: 0:34:47  lr: 0.000100  loss: 5.1321 (5.5524)  at: 5.1321 (5.5524)  at_unscaled: 5.1321 (5.5524)  time: 0.8007  data: 0.0380  max mem: 29528
Epoch: [0]  [1110/3696]  eta: 0:34:39  lr: 0.000100  loss: 5.1603 (5.5489)  at: 5.1603 (5.5489)  at_unscaled: 5.1603 (5.5489)  time: 0.7850  data: 0.0382  max mem: 29528
Epoch: [0]  [1120/3696]  eta: 0:34:30  lr: 0.000100  loss: 5.1443 (5.5452)  at: 5.1443 (5.5452)  at_unscaled: 5.1443 (5.5452)  time: 0.7765  data: 0.0383  max mem: 29528
Epoch: [0]  [1130/3696]  eta: 0:34:21  lr: 0.000100  loss: 5.1185 (5.5413)  at: 5.1185 (5.5413)  at_unscaled: 5.1185 (5.5413)  time: 0.7790  data: 0.0372  max mem: 29528
Epoch: [0]  [1140/3696]  eta: 0:34:13  lr: 0.000100  loss: 5.0800 (5.5374)  at: 5.0800 (5.5374)  at_unscaled: 5.0800 (5.5374)  time: 0.7986  data: 0.0356  max mem: 29528
Epoch: [0]  [1150/3696]  eta: 0:34:04  lr: 0.000100  loss: 5.1101 (5.5337)  at: 5.1101 (5.5337)  at_unscaled: 5.1101 (5.5337)  time: 0.7654  data: 0.0345  max mem: 29528
Epoch: [0]  [1160/3696]  eta: 0:33:56  lr: 0.000100  loss: 5.1744 (5.5307)  at: 5.1744 (5.5307)  at_unscaled: 5.1744 (5.5307)  time: 0.7695  data: 0.0344  max mem: 29528
Epoch: [0]  [1170/3696]  eta: 0:33:47  lr: 0.000100  loss: 5.1829 (5.5277)  at: 5.1829 (5.5277)  at_unscaled: 5.1829 (5.5277)  time: 0.7968  data: 0.0362  max mem: 29528
Epoch: [0]  [1180/3696]  eta: 0:33:40  lr: 0.000100  loss: 5.1845 (5.5246)  at: 5.1845 (5.5246)  at_unscaled: 5.1845 (5.5246)  time: 0.8120  data: 0.0374  max mem: 29528
Epoch: [0]  [1190/3696]  eta: 0:33:32  lr: 0.000100  loss: 5.1798 (5.5216)  at: 5.1798 (5.5216)  at_unscaled: 5.1798 (5.5216)  time: 0.8169  data: 0.0371  max mem: 29528
Epoch: [0]  [1200/3696]  eta: 0:33:23  lr: 0.000100  loss: 5.1929 (5.5188)  at: 5.1929 (5.5188)  at_unscaled: 5.1929 (5.5188)  time: 0.7739  data: 0.0361  max mem: 29528
Epoch: [0]  [1210/3696]  eta: 0:33:16  lr: 0.000100  loss: 5.1929 (5.5158)  at: 5.1929 (5.5158)  at_unscaled: 5.1929 (5.5158)  time: 0.7985  data: 0.0340  max mem: 29528
Epoch: [0]  [1220/3696]  eta: 0:33:07  lr: 0.000100  loss: 5.1322 (5.5126)  at: 5.1322 (5.5126)  at_unscaled: 5.1322 (5.5126)  time: 0.8027  data: 0.0350  max mem: 29528
Epoch: [0]  [1230/3696]  eta: 0:32:59  lr: 0.000100  loss: 5.1595 (5.5096)  at: 5.1595 (5.5096)  at_unscaled: 5.1595 (5.5096)  time: 0.7881  data: 0.0374  max mem: 29528
Epoch: [0]  [1240/3696]  eta: 0:32:50  lr: 0.000100  loss: 5.1620 (5.5067)  at: 5.1620 (5.5067)  at_unscaled: 5.1620 (5.5067)  time: 0.7849  data: 0.0365  max mem: 29528
Epoch: [0]  [1250/3696]  eta: 0:32:42  lr: 0.000100  loss: 5.1620 (5.5038)  at: 5.1620 (5.5038)  at_unscaled: 5.1620 (5.5038)  time: 0.7893  data: 0.0357  max mem: 29528
Epoch: [0]  [1260/3696]  eta: 0:32:34  lr: 0.000100  loss: 5.1245 (5.5005)  at: 5.1245 (5.5005)  at_unscaled: 5.1245 (5.5005)  time: 0.8002  data: 0.0359  max mem: 29528
Epoch: [0]  [1270/3696]  eta: 0:32:26  lr: 0.000100  loss: 5.1023 (5.4975)  at: 5.1023 (5.4975)  at_unscaled: 5.1023 (5.4975)  time: 0.8015  data: 0.0362  max mem: 29528
Epoch: [0]  [1280/3696]  eta: 0:32:17  lr: 0.000100  loss: 5.1132 (5.4946)  at: 5.1132 (5.4946)  at_unscaled: 5.1132 (5.4946)  time: 0.7906  data: 0.0349  max mem: 29528
Epoch: [0]  [1290/3696]  eta: 0:32:09  lr: 0.000100  loss: 5.1292 (5.4918)  at: 5.1292 (5.4918)  at_unscaled: 5.1292 (5.4918)  time: 0.7743  data: 0.0334  max mem: 29528
Epoch: [0]  [1300/3696]  eta: 0:32:01  lr: 0.000100  loss: 5.1292 (5.4890)  at: 5.1292 (5.4890)  at_unscaled: 5.1292 (5.4890)  time: 0.7875  data: 0.0339  max mem: 29528
Epoch: [0]  [1310/3696]  eta: 0:31:54  lr: 0.000100  loss: 5.1232 (5.4863)  at: 5.1232 (5.4863)  at_unscaled: 5.1232 (5.4863)  time: 0.8117  data: 0.0343  max mem: 29528
Epoch: [0]  [1320/3696]  eta: 0:31:45  lr: 0.000100  loss: 5.1016 (5.4832)  at: 5.1016 (5.4832)  at_unscaled: 5.1016 (5.4832)  time: 0.8161  data: 0.0341  max mem: 29528
Epoch: [0]  [1330/3696]  eta: 0:31:38  lr: 0.000100  loss: 5.0905 (5.4805)  at: 5.0905 (5.4805)  at_unscaled: 5.0905 (5.4805)  time: 0.8149  data: 0.0343  max mem: 29528
Traceback (most recent call last):
  File "main.py", line 257, in <module>
    main(args)
  File "main.py", line 207, in main
    args.clip_max_norm, learning_rate_schedule)
  File "/opt/tiger/intro/Stable-Pix2Seq/engine.py", line 98, in train_one_epoch
    losses.backward()
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 7; 31.75 GiB total capacity; 29.63 GiB already allocated; 213.75 MiB free; 29.95 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main.py', '--coco_path', './coco2017/', '--batch_size', '4', '--lr', '0.0005', '--output_dir', './output']' returned non-zero exit status 1.
Killing subprocess 5627
Killing subprocess 5628
Killing subprocess 5629
Killing subprocess 5630
Killing subprocess 5631
Killing subprocess 5632
Killing subprocess 5633

allanj avatar Sep 19 '22 06:09 allanj

Changing 4 to 3 works for me though. 😞

allanj avatar Sep 19 '22 07:09 allanj