mdetr icon indicating copy to clipboard operation
mdetr copied to clipboard

Logs for the fine tuning on LVIS detection

Open Flaick opened this issue 3 years ago • 6 comments

Hello, I am wondering if there is log file available for the fine tuning on 1% LVIS few shot detection.

Flaick avatar Oct 26 '21 05:10 Flaick

Also, I will be grateful if you can provide the hyperparameter setting for the 1% experiment

Flaick avatar Oct 26 '21 05:10 Flaick

@Flaick, have you managed to run the fine-tuning?

I have a strange error. When I run python main.py --dataset_config configs/lvis.json --load pretrained_resnet101_checkpoint.pth --ema --epochs 150 --lr_drop 120 --eval_skip 5 on GPU, I get:

Epoch: [0]  [    0/73902]  eta: 1 day, 13:20:57  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 14.1489 (14.1489)  loss_ce: 2.3089 (2.3089)  loss_bbox: 0.0000 (0.0000)  loss_giou: 0.0000 (0.0000)  loss_contrastive_align: 0.0000 (0.0000)  loss_ce_0: 2.2728 (2.2728)  loss_bbox_0: 0.0000 (0.0000)  loss_giou_0: 0.0000 (0.0000)  loss_contrastive_align_0: 0.0000 (0.0000)  loss_ce_1: 2.1969 (2.1969)  loss_bbox_1: 0.0000 (0.0000)  loss_giou_1: 0.0000 (0.0000)  loss_contrastive_align_1: 0.0000 (0.0000)  loss_ce_2: 2.4855 (2.4855)  loss_bbox_2: 0.0000 (0.0000)  loss_giou_2: 0.0000 (0.0000)  loss_contrastive_align_2: 0.0000 (0.0000)  loss_ce_3: 2.5023 (2.5023)  loss_bbox_3: 0.0000 (0.0000)  loss_giou_3: 0.0000 (0.0000)  loss_contrastive_align_3: 0.0000 (0.0000)  loss_ce_4: 2.3826 (2.3826)  loss_bbox_4: 0.0000 (0.0000)  loss_giou_4: 0.0000 (0.0000)  loss_contrastive_align_4: 0.0000 (0.0000)  loss_ce_unscaled: 2.3089 (2.3089)  loss_bbox_unscaled: 0.0000 (0.0000)  loss_giou_unscaled: 0.0000 (0.0000)  cardinality_error_unscaled: 2.0000 (2.0000)  loss_contrastive_align_unscaled: 0.0000 (0.0000)  loss_ce_0_unscaled: 2.2728 (2.2728)  loss_bbox_0_unscaled: 0.0000 (0.0000)  loss_giou_0_unscaled: 0.0000 (0.0000)  cardinality_error_0_unscaled: 3.0000 (3.0000)  loss_contrastive_align_0_unscaled: 0.0000 (0.0000)  loss_ce_1_unscaled: 2.1969 (2.1969)  loss_bbox_1_unscaled: 0.0000 (0.0000)  loss_giou_1_unscaled: 0.0000 (0.0000)  cardinality_error_1_unscaled: 3.0000 (3.0000)  loss_contrastive_align_1_unscaled: 0.0000 (0.0000)  loss_ce_2_unscaled: 2.4855 (2.4855)  loss_bbox_2_unscaled: 0.0000 (0.0000)  loss_giou_2_unscaled: 0.0000 (0.0000)  cardinality_error_2_unscaled: 2.0000 (2.0000)  loss_contrastive_align_2_unscaled: 0.0000 (0.0000)  loss_ce_3_unscaled: 2.5023 (2.5023)  loss_bbox_3_unscaled: 0.0000 (0.0000)  loss_giou_3_unscaled: 0.0000 (0.0000)  cardinality_error_3_unscaled: 2.0000 (2.0000)  loss_contrastive_align_3_unscaled: 0.0000 (0.0000)  loss_ce_4_unscaled: 2.3826 (2.3826)  loss_bbox_4_unscaled: 0.0000 (0.0000)  loss_giou_4_unscaled: 0.0000 (0.0000)  cardinality_error_4_unscaled: 2.0000 (2.0000)  loss_contrastive_align_4_unscaled: 0.0000 (0.0000)  time: 1.8194  data: 1.2582  max mem: 4014
Traceback (most recent call last):
  File "main.py", line 643, in <module>
    main(args)
  File "main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/home/pchelintsev/MDETR_untouched/mdetr/engine.py", line 73, in train_one_epoch
    loss_dict.update(criterion(outputs, targets, positive_map))
  File "/home/pchelintsev/.conda/envs/mdetr_env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR_untouched/mdetr/models/mdetr.py", line 679, in forward
    losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes))
  File "/home/pchelintsev/MDETR_untouched/mdetr/models/mdetr.py", line 655, in get_loss
    return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs)
  File "/home/pchelintsev/MDETR_untouched/mdetr/models/mdetr.py", line 487, in loss_labels
    eos_coef[src_idx] = 1
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor61

So, as it was suggested in the other issue, I run it on CPU and it worked!

Starting epoch 0
/home/pchelintsev/.conda/envs/mdetr_env3/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Epoch: [0]  [    0/73902]  eta: 25 days, 11:34:41  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 14.3907 (14.3907)  loss_ce: 2.4076 (2.4076)  loss_bbox: 0.0000 (0.0000)  loss_giou: 0.0000 (0.0000)  loss_contrastive_align: 0.0000 (0.0000)  loss_ce_0: 2.4669 (2.4669)  loss_bbox_0: 0.0000 (0.0000)  loss_giou_0: 0.0000 (0.0000)  loss_contrastive_align_0: 0.0000 (0.0000)  loss_ce_1: 2.2301 (2.2301)  loss_bbox_1: 0.0000 (0.0000)  loss_giou_1: 0.0000 (0.0000)  loss_contrastive_align_1: 0.0000 (0.0000)  loss_ce_2: 2.5516 (2.5516)  loss_bbox_2: 0.0000 (0.0000)  loss_giou_2: 0.0000 (0.0000)  loss_contrastive_align_2: 0.0000 (0.0000)  loss_ce_3: 2.3101 (2.3101)  loss_bbox_3: 0.0000 (0.0000)  loss_giou_3: 0.0000 (0.0000)  loss_contrastive_align_3: 0.0000 (0.0000)  loss_ce_4: 2.4244 (2.4244)  loss_bbox_4: 0.0000 (0.0000)  loss_giou_4: 0.0000 (0.0000)  loss_contrastive_align_4: 0.0000 (0.0000)  loss_ce_unscaled: 2.4076 (2.4076)  loss_bbox_unscaled: 0.0000 (0.0000)  loss_giou_unscaled: 0.0000 (0.0000)  cardinality_error_unscaled: 3.0000 (3.0000)  loss_contrastive_align_unscaled: 0.0000 (0.0000)  loss_ce_0_unscaled: 2.4669 (2.4669)  loss_bbox_0_unscaled: 0.0000 (0.0000)  loss_giou_0_unscaled: 0.0000 (0.0000)  cardinality_error_0_unscaled: 3.0000 (3.0000)  loss_contrastive_align_0_unscaled: 0.0000 (0.0000)  loss_ce_1_unscaled: 2.2301 (2.2301)  loss_bbox_1_unscaled: 0.0000 (0.0000)  loss_giou_1_unscaled: 0.0000 (0.0000)  cardinality_error_1_unscaled: 2.0000 (2.0000)  loss_contrastive_align_1_unscaled: 0.0000 (0.0000)  loss_ce_2_unscaled: 2.5516 (2.5516)  loss_bbox_2_unscaled: 0.0000 (0.0000)  loss_giou_2_unscaled: 0.0000 (0.0000)  cardinality_error_2_unscaled: 3.0000 (3.0000)  loss_contrastive_align_2_unscaled: 0.0000 (0.0000)  loss_ce_3_unscaled: 2.3101 (2.3101)  loss_bbox_3_unscaled: 0.0000 (0.0000)  loss_giou_3_unscaled: 0.0000 (0.0000)  cardinality_error_3_unscaled: 2.5000 (2.5000)  loss_contrastive_align_3_unscaled: 0.0000 (0.0000)  loss_ce_4_unscaled: 2.4244 (2.4244)  loss_bbox_4_unscaled: 0.0000 (0.0000)  loss_giou_4_unscaled: 0.0000 (0.0000)  cardinality_error_4_unscaled: 3.0000 (3.0000)  loss_contrastive_align_4_unscaled: 0.0000 (0.0000)  time: 29.7919  data: 1.1390  max mem: 0
Epoch: [0]  [   10/73902]  eta: 18 days, 6:11:53  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 40.6416 (50.0278)  loss_ce: 3.9509 (4.6128)  loss_bbox: 0.2337 (0.4157)  loss_giou: 0.5140 (0.6735)  loss_contrastive_align: 1.1038 (2.0033)  loss_ce_0: 6.8205 (5.5707)  loss_bbox_0: 0.1225 (0.3284)  loss_giou_0: 0.3370 (0.6317)  loss_contrastive_align_0: 1.8220 (2.5607)  loss_ce_1: 5.9798 (5.2364)  loss_bbox_1: 0.2626 (0.3924)  loss_giou_1: 0.4373 (0.7185)  loss_contrastive_align_1: 1.6724 (2.5045)  loss_ce_2: 4.3847 (5.0343)  loss_bbox_2: 0.2473 (0.3728)  loss_giou_2: 0.5318 (0.6514)  loss_contrastive_align_2: 1.0731 (2.3479)  loss_ce_3: 4.0940 (4.8984)  loss_bbox_3: 0.2544 (0.4026)  loss_giou_3: 0.5044 (0.6831)  loss_contrastive_align_3: 1.0696 (2.1846)  loss_ce_4: 3.9194 (4.6899)  loss_bbox_4: 0.2297 (0.4037)  loss_giou_4: 0.4369 (0.6624)  loss_contrastive_align_4: 1.0977 (2.0480)  loss_ce_unscaled: 3.9509 (4.6128)  loss_bbox_unscaled: 0.0467 (0.0831)  loss_giou_unscaled: 0.2570 (0.3368)  cardinality_error_unscaled: 1.0000 (1.0909)  loss_contrastive_align_unscaled: 1.1038 (2.0033)  loss_ce_0_unscaled: 6.8205 (5.5707)  loss_bbox_0_unscaled: 0.0245 (0.0657)  loss_giou_0_unscaled: 0.1685 (0.3159)  cardinality_error_0_unscaled: 1.0000 (1.5000)  loss_contrastive_align_0_unscaled: 1.8220 (2.5607)  loss_ce_1_unscaled: 5.9798 (5.2364)  loss_bbox_1_unscaled: 0.0525 (0.0785)  loss_giou_1_unscaled: 0.2186 (0.3593)  cardinality_error_1_unscaled: 1.0000 (1.1818)  loss_contrastive_align_1_unscaled: 1.6724 (2.5045)  loss_ce_2_unscaled: 4.3847 (5.0343)  loss_bbox_2_unscaled: 0.0495 (0.0746)  loss_giou_2_unscaled: 0.2659 (0.3257)  cardinality_error_2_unscaled: 1.0000 (1.2273)  loss_contrastive_align_2_unscaled: 1.0731 (2.3479)  loss_ce_3_unscaled: 4.0940 (4.8984)  loss_bbox_3_unscaled: 0.0509 (0.0805)  loss_giou_3_unscaled: 0.2522 (0.3415)  cardinality_error_3_unscaled: 1.0000 (1.1818)  loss_contrastive_align_3_unscaled: 1.0696 (2.1846)  loss_ce_4_unscaled: 3.9194 (4.6899)  loss_bbox_4_unscaled: 0.0459 (0.0807)  loss_giou_4_unscaled: 0.2184 (0.3312)  cardinality_error_4_unscaled: 1.0000 (1.0909)  loss_contrastive_align_4_unscaled: 1.0977 (2.0480)  time: 21.3489  data: 0.1094  max mem: 0
Epoch: [0]  [   20/73902]  eta: 18 days, 16:13:48  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 34.8511 (46.1291)  loss_ce: 2.4758 (5.0781)  loss_bbox: 0.2299 (0.3856)  loss_giou: 0.4293 (0.7006)  loss_contrastive_align: 0.4411 (1.4506)  loss_ce_0: 4.1652 (5.0743)  loss_bbox_0: 0.0811 (0.3611)  loss_giou_0: 0.1932 (0.6737)  loss_contrastive_align_0: 0.3715 (1.6258)  loss_ce_1: 2.6043 (5.1180)  loss_bbox_1: 0.1499 (0.3908)  loss_giou_1: 0.3773 (0.7349)  loss_contrastive_align_1: 0.3961 (1.6426)  loss_ce_2: 2.6675 (5.0676)  loss_bbox_2: 0.1963 (0.3785)  loss_giou_2: 0.3574 (0.6974)  loss_contrastive_align_2: 0.3626 (1.5843)  loss_ce_3: 2.6249 (5.0039)  loss_bbox_3: 0.1436 (0.3871)  loss_giou_3: 0.4561 (0.6985)  loss_contrastive_align_3: 0.3725 (1.5028)  loss_ce_4: 2.5412 (5.0421)  loss_bbox_4: 0.1969 (0.3801)  loss_giou_4: 0.4178 (0.6954)  loss_contrastive_align_4: 0.4074 (1.4550)  loss_ce_unscaled: 2.4758 (5.0781)  loss_bbox_unscaled: 0.0460 (0.0771)  loss_giou_unscaled: 0.2146 (0.3503)  cardinality_error_unscaled: 1.0000 (1.1429)  loss_contrastive_align_unscaled: 0.4411 (1.4506)  loss_ce_0_unscaled: 4.1652 (5.0743)  loss_bbox_0_unscaled: 0.0162 (0.0722)  loss_giou_0_unscaled: 0.0966 (0.3368)  cardinality_error_0_unscaled: 1.0000 (1.4286)  loss_contrastive_align_0_unscaled: 0.3715 (1.6258)  loss_ce_1_unscaled: 2.6043 (5.1180)  loss_bbox_1_unscaled: 0.0300 (0.0782)  loss_giou_1_unscaled: 0.1886 (0.3675)  cardinality_error_1_unscaled: 1.0000 (1.2143)  loss_contrastive_align_1_unscaled: 0.3961 (1.6426)  loss_ce_2_unscaled: 2.6675 (5.0676)  loss_bbox_2_unscaled: 0.0393 (0.0757)  loss_giou_2_unscaled: 0.1787 (0.3487)  cardinality_error_2_unscaled: 1.0000 (1.2857)  loss_contrastive_align_2_unscaled: 0.3626 (1.5843)  loss_ce_3_unscaled: 2.6249 (5.0039)  loss_bbox_3_unscaled: 0.0287 (0.0774)  loss_giou_3_unscaled: 0.2281 (0.3492)  cardinality_error_3_unscaled: 1.0000 (1.2143)  loss_contrastive_align_3_unscaled: 0.3725 (1.5028)  loss_ce_4_unscaled: 2.5412 (5.0421)  loss_bbox_4_unscaled: 0.0394 (0.0760)  loss_giou_4_unscaled: 0.2089 (0.3477)  cardinality_error_4_unscaled: 1.0000 (1.1905)  loss_contrastive_align_4_unscaled: 0.4074 (1.4550)  time: 21.4431  data: 0.0061  max mem: 0

What can be wrong?((( Also, I've made sure that transformers version is 4.5.1

TopCoder2K avatar Dec 27 '21 19:12 TopCoder2K

I did not encounter that error when running on the GPU. I deploy it with slurm on 8 2080 ti cards, and I am not sure what is happening here. Sorry that I can not help with that.

Flaick avatar Jan 06 '22 05:01 Flaick

Hmm, that's strange... We have to have the same libraries. What CUDA version do you have?

I hope @alcinos can help! Here is the necessary info:

  • the exact command is
python main.py --dataset_config configs/lvis.json --load pretrained_resnet101_checkpoint.pth --ema --epochs 150 --lr_drop 120 --eval_skip 5
  • my environment is
Tesla V100 PCIe 32GB
Python 3.8.12
CUDA Version 11.0.228
  • packages in the conda environment:
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
abseil-cpp                20210324.2           h2531618_0  
aiohttp                   3.7.4.post0      py38h7f8727e_2  
appdirs                   1.4.4                    pypi_0    pypi
arrow-cpp                 3.0.0            py38h6b21186_4  
async-timeout             3.0.1            py38h06a4308_0  
attrs                     21.2.0             pyhd3eb1b0_0  
aws-c-common              0.4.57               he6710b0_1  
aws-c-event-stream        0.1.6                h2531618_5  
aws-checksums             0.1.9                he6710b0_0  
aws-sdk-cpp               1.8.185              hce553d0_0  
bcj-cffi                  0.5.1            py38h295c915_0  
blas                      1.0                         mkl  
boost-cpp                 1.73.0              h27cfd23_11  
bottleneck                1.3.2            py38heb32a55_1  
brotli                    1.0.9                he6710b0_2  
brotli-python             1.0.9            py38heb0550a_2  
brotlicffi                1.0.9.2          py38h295c915_0  
brotlipy                  0.7.0           py38h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.17.1               h27cfd23_0  
ca-certificates           2021.10.26           h06a4308_2  
certifi                   2021.10.8        py38h06a4308_0  
cffi                      1.14.6           py38h400218f_0  
chardet                   4.0.0           py38h06a4308_1003  
charset-normalizer        2.0.7                    pypi_0    pypi
click                     8.0.3                    pypi_0    pypi
cloudpickle               2.0.0                    pypi_0    pypi
colorama                  0.4.4              pyhd3eb1b0_0  
conllu                    4.4.1              pyhd3eb1b0_0  
cryptography              35.0.0           py38hd23ed53_0  
cycler                    0.10.0                   pypi_0    pypi
cython                    0.29.24                  pypi_0    pypi
dataclasses               0.8                pyh6d0b6a4_7  
datasets                  1.12.1             pyhd3eb1b0_0  
dill                      0.3.4              pyhd3eb1b0_0  
double-conversion         3.1.5                he6710b0_1  
et_xmlfile                1.1.0            py38h06a4308_0  
filelock                  3.3.1              pyhd3eb1b0_1  
fsspec                    2021.8.1           pyhd3eb1b0_0  
gflags                    2.2.2                he6710b0_0  
glog                      0.5.0                h2531618_0  
gmp                       6.2.1                h2531618_2  
grpc-cpp                  1.39.0               hae934f6_5  
h5py                      3.5.0                    pypi_0    pypi
huggingface-hub           0.0.19                   pypi_0    pypi
huggingface_hub           0.0.17             pyhd3eb1b0_0  
icu                       58.2                 he6710b0_3  
idna                      3.3                      pypi_0    pypi
importlib-metadata        4.8.1            py38h06a4308_0  
importlib_metadata        4.8.1                hd3eb1b0_0  
intel-openmp              2021.4.0          h06a4308_3561  
joblib                    1.1.0                    pypi_0    pypi
kiwisolver                1.3.2                    pypi_0    pypi
krb5                      1.19.2               hac12032_0  
ld_impl_linux-64          2.35.1               h7274673_9  
libboost                  1.73.0              h3ff78a5_11  
libcurl                   7.78.0               h0b77cf5_0  
libedit                   3.1.20210714         h7f8727e_0  
libev                     4.33                 h7f8727e_1  
libevent                  2.1.8                h1ba5d50_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgomp                   9.3.0               h5101ec6_17  
libnghttp2                1.41.0               hf8bcb03_2  
libprotobuf               3.17.2               h4ff587b_1  
libssh2                   1.9.0                h1ba5d50_1  
libstdcxx-ng              9.3.0               hd4cf53a_17  
libthrift                 0.14.2               hcc01f38_0  
libxml2                   2.9.12               h03d6c58_0  
libxslt                   1.1.34               hc22bd24_0  
lxml                      4.6.3            py38h9120a33_0  
lz4-c                     1.9.3                h295c915_1  
matplotlib                3.4.3                    pypi_0    pypi
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py38h7f8727e_0  
mkl_fft                   1.3.1            py38hd3c417c_0  
mkl_random                1.2.2            py38h51133e4_0  
multidict                 5.1.0            py38h27cfd23_2  
multimodal                0.0.12                   pypi_0    pypi
multiprocess              0.70.12.2        py38h7f8727e_0  
multivolumefile           0.2.3              pyhd3eb1b0_0  
ncurses                   6.2                  he6710b0_1  
numexpr                   2.7.3            py38h22e1b3c_1  
numpy                     1.21.3                   pypi_0    pypi
numpy-base                1.21.2           py38h79a1101_0  
openpyxl                  3.0.9              pyhd3eb1b0_0  
openssl                   1.1.1l               h7f8727e_0  
orc                       1.6.9                ha97a36c_3  
packaging                 21.0               pyhd3eb1b0_0  
pandas                    1.3.4            py38h8c16a72_0  
pillow                    8.4.0                    pypi_0    pypi
pip                       21.2.4           py38h06a4308_0  
portalocker               2.3.0            py38h06a4308_0  
py7zr                     0.16.1             pyhd3eb1b0_1  
pyarrow                   3.0.0            py38he0739d4_3  
pycparser                 2.20                       py_2  
pycryptodomex             3.10.1           py38h27cfd23_1  
pyopenssl                 21.0.0             pyhd3eb1b0_1  
pyparsing                 3.0.1                    pypi_0    pypi
pyppmd                    0.16.1           py38h295c915_0  
pysmartdl                 1.3.4                    pypi_0    pypi
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.12               h12debd9_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
python-xxhash             2.0.2            py38h7f8727e_0  
pytz                      2021.3             pyhd3eb1b0_0  
pyyaml                    6.0                      pypi_0    pypi
pyzstd                    0.14.4           py38h7f8727e_3  
re2                       2020.11.01           h2531618_1  
readline                  8.1                  h27cfd23_0  
regex                     2021.10.23               pypi_0    pypi
requests                  2.26.0             pyhd3eb1b0_0  
sacrebleu                 2.0.0              pyhd3eb1b0_1  
scipy                     1.7.1                    pypi_0    pypi
sentencepiece             0.1.95           py38hd09550d_0  
setuptools                58.0.4           py38h06a4308_0  
six                       1.16.0             pyhd3eb1b0_0  
snappy                    1.1.8                he6710b0_0  
sqlite                    3.36.0               hc218d9a_0  
tables                    3.6.1                    pypi_0    pypi
tabulate                  0.8.9            py38h06a4308_0  
texttable                 1.6.4              pyhd3eb1b0_0  
tk                        8.6.11               h1ccaba5_0  
torch                     1.9.1                    pypi_0    pypi
torchtext                 0.10.1                   pypi_0    pypi
torchvision               0.10.1                   pypi_0    pypi
tqdm                      4.62.3                   pypi_0    pypi
transformers              4.5.1                    pypi_0    pypi
typing                    3.10.0.0         py38h06a4308_0  
typing-extensions         3.10.0.2             hd3eb1b0_0  
typing_extensions         3.10.0.2           pyh06a4308_0  
uriparser                 0.9.3                he6710b0_1  
urllib3                   1.26.7                   pypi_0    pypi
utf8proc                  2.6.1                h27cfd23_0  
wcwidth                   0.2.5                    pypi_0    pypi
wheel                     0.37.0             pyhd3eb1b0_1  
xmltodict                 0.12.0                   pypi_0    pypi
xxhash                    0.8.0                h7f8727e_3  
xz                        5.2.5                h7b6447c_0  
yaml                      0.2.5                h7b6447c_0  
yarl                      1.6.3            py38h27cfd23_0  
zipp                      3.6.0              pyhd3eb1b0_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.9                haebb681_0
  • no, I didn't change the code and used the GitHub version
  • I tried using the CPU option and it worked for more than 1.5 days and almost finished half of the first epoch (then I just shut it down)
  • I checked that GPU RAM was not running out using gpustat (with batch_size=1 MDETR consumes not bigger than 7 Gb)

TopCoder2K avatar Jan 15 '22 18:01 TopCoder2K

Also, I tried to run fine-tuning in docker with CUDA 10.2 and CUDA 11.1. Again, it works on the CPU but I still get the same mistake on the GPU :( What did I run to setup the environments?

conda init
bash
conda create -n mdetr_env python=3.8
conda activate mdetr_env
pip install numpy
pip install -r requirements.txt

numpy is needed because pycocotools uses it (I got an error without numpy installed). Also, maybe it's worth pointing out that pycocotools ''was installed using the legacy 'setup.py install' method, because a wheel could not be built for it''. conda list gives:

_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
ca-certificates           2021.10.26           h06a4308_2  
certifi                   2021.10.8        py38h06a4308_2  
charset-normalizer        2.0.10                   pypi_0    pypi
click                     8.0.3                    pypi_0    pypi
cloudpickle               2.0.0                    pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
cython                    0.29.26                  pypi_0    pypi
filelock                  3.4.2                    pypi_0    pypi
flatbuffers               2.0                      pypi_0    pypi
fonttools                 4.29.0                   pypi_0    pypi
idna                      3.3                      pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
kiwisolver                1.3.2                    pypi_0    pypi
ld_impl_linux-64          2.35.1               h7274673_9  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgomp                   9.3.0               h5101ec6_17  
libstdcxx-ng              9.3.0               hd4cf53a_17  
matplotlib                3.5.1                    pypi_0    pypi
ncurses                   6.3                  h7f8727e_2  
numpy                     1.22.1                   pypi_0    pypi
onnx                      1.10.2                   pypi_0    pypi
onnxruntime               1.10.0                   pypi_0    pypi
openssl                   1.1.1m               h7f8727e_0  
packaging                 21.3                     pypi_0    pypi
panopticapi               0.1                      pypi_0    pypi
pillow                    9.0.0                    pypi_0    pypi
pip                       21.2.4           py38h06a4308_0  
prettytable               3.0.0                    pypi_0    pypi
protobuf                  3.19.3                   pypi_0    pypi
pycocotools               2.0                      pypi_0    pypi
pyparsing                 3.0.7                    pypi_0    pypi
python                    3.8.12               h12debd9_0  
python-dateutil           2.8.2                    pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
regex                     2022.1.18                pypi_0    pypi
requests                  2.27.1                   pypi_0    pypi
sacremoses                0.0.47                   pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
setuptools                58.0.4           py38h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.37.0               hc218d9a_0  
submitit                  1.4.1                    pypi_0    pypi
timm                      0.5.4                    pypi_0    pypi
tk                        8.6.11               h1ccaba5_0  
tokenizers                0.10.3                   pypi_0    pypi
torch                     1.9.1                    pypi_0    pypi
torchvision               0.10.1                   pypi_0    pypi
tqdm                      4.62.3                   pypi_0    pypi
transformers              4.5.1                    pypi_0    pypi
typing-extensions         4.0.1                    pypi_0    pypi
urllib3                   1.26.8                   pypi_0    pypi
wcwidth                   0.2.5                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
xmltodict                 0.12.0                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7f8727e_4

transformers=4.5.1, so I have no idea why the mistake occurs. Maybe, I should try the good old 'print' method, print all the sizes in the hope of noticing something wrong.

TopCoder2K avatar Jan 26 '22 10:01 TopCoder2K

Oh, there is no error with python=3.7.10, torch=1.8.1, torchvision=0.9.1, CUDA=11.1, transformers=4.5.1! With the recommended python=3.8 it also works (I'm using python=3.8.12)

TopCoder2K avatar Jan 26 '22 14:01 TopCoder2K