SparseInst icon indicating copy to clipboard operation
SparseInst copied to clipboard

ValueError: matrix contains invalid numeric entries for binary classification

Open kirillkoncha opened this issue 2 years ago • 3 comments

Hi! I would like to train sparse_inst_r50_giam_fp16 for binary classification. I registered my train and test datasets and started training with python3.9 tools/train_net.py --config-file configs/sparse_inst_r50_giam_fp16.yaml --num-gpus 1 SOLVER.AMP.ENABLED True command. As far as I understand, there is no need to change sizes of images in my datasets (they are all 2048x2448).

However, I got ValueError: matrix contains invalid numeric entries on the 97th iteration.

Here is my environment and full logs:

[09/29 01:09:23] detectron2 INFO: Rank of current process: 0. World size: 1
[09/29 01:09:31] detectron2 INFO: Environment info:
----------------------  ------------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.9.10 (main, Jan 15 2022, 18:56:52) [GCC 7.5.0]
numpy                   1.23.3
detectron2              0.6 @/raid/kirill/test/venv/lib/python3.9/site-packages/detectron2
Compiler                GCC 7.3
CUDA compiler           CUDA 11.1
detectron2 arch flags   3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.10.0+cu111 @/raid/kirill/test/venv/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   Tesla V100-SXM3-32GB (arch=7.0)
Driver version          450.142.00
CUDA_HOME               /usr/local/cuda
Pillow                  9.2.0
torchvision             0.11.0+cu111 @/raid/kirill/test/venv/lib/python3.9/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20220512
iopath                  0.1.9
cv2                     4.6.0
----------------------  ------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

[09/29 01:09:31] detectron2 INFO: Command line arguments: Namespace(config_file='configs/sparse_inst_r50_giam_fp16.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:50153', opts=['SOLVER.AMP.ENABLED', 'True'])
[09/29 01:09:31] detectron2 INFO: Contents of args.config_file=configs/sparse_inst_r50_giam_fp16.yaml:
_BASE_: "Base-SparseInst.yaml"
MODEL:
  WEIGHTS: "pretrained_models/R-50.pkl"
SOLVER:
  AMP:
    ENABLED: True
OUTPUT_DIR: "output/sparse_inst_r50_giam_fp16"
[09/29 01:09:31] detectron2 INFO: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
  ASPECT_RATIO_GROUPING: true
  FILTER_EMPTY_ANNOTATIONS: true
  NUM_WORKERS: 6
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: []
  PROPOSAL_FILES_TRAIN: []
  TEST:
  - maf_val
  TRAIN:
  - maf_train
GLOBAL:
  HACK: 1.0
INPUT:
  CROP:
    ENABLED: false
    SIZE:
    - 0.9
    - 0.9
    TYPE: relative_range
  FORMAT: RGB
  MASK_FORMAT: bitmask
  MAX_SIZE_TEST: 853
  MAX_SIZE_TRAIN: 853
  MIN_SIZE_TEST: 640
  MIN_SIZE_TRAIN:
  - 416
  - 448
  - 480
  - 512
  - 544
  - 576
  - 608
  - 640
  MIN_SIZE_TRAIN_SAMPLING: choice
  RANDOM_FLIP: horizontal
MODEL:
  ANCHOR_GENERATOR:
    ANGLES:
    - - -90
      - 0
      - 90
    ASPECT_RATIOS:
    - - 0.5
      - 1.0
      - 2.0
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES:
    - - 32
      - 64
      - 128
      - 256
      - 512
  BACKBONE:
    FREEZE_AT: 0
    NAME: build_resnet_backbone
  CSPNET:
    NAME: darknet53
    NORM: ''
    OUT_FEATURES:
    - csp1
    - csp2
    - csp3
    - csp4
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES: []
    NORM: ''
    OUT_CHANNELS: 256
  KEYPOINT_ON: false
  LOAD_PROPOSALS: false
  MASK_ON: true
  META_ARCHITECTURE: SparseInst
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: true
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN:
  - 123.675
  - 116.28
  - 103.53
  PIXEL_STD:
  - 58.395
  - 57.12
  - 57.375
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  PVT:
    LINEAR: false
    NAME: b1
    OUT_FEATURES:
    - p2
    - p3
    - p4
  RESNETS:
    DEFORM_MODULATED: false
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE:
    - false
    - false
    - false
    - false
    DEPTH: 50
    NORM: FrozenBN
    NUM_GROUPS: 1
    OUT_FEATURES:
    - res3
    - res4
    - res5
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: false
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_WEIGHTS: &id002
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES:
    - p3
    - p4
    - p5
    - p6
    - p7
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.4
    - 0.5
    NMS_THRESH_TEST: 0.5
    NORM: ''
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS:
    - &id001
      - 10.0
      - 10.0
      - 5.0
      - 5.0
    - - 20.0
      - 20.0
      - 10.0
      - 10.0
    - - 30.0
      - 30.0
      - 15.0
      - 15.0
    IOUS:
    - 0.5
    - 0.6
    - 0.7
  ROI_BOX_HEAD:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id001
    CLS_AGNOSTIC_BBOX_REG: false
    CONV_DIM: 256
    FC_DIM: 1024
    NAME: ''
    NORM: ''
    NUM_CONV: 0
    NUM_FC: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.0
    TRAIN_ON_PRED_BOXES: false
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES:
    - res4
    IOU_LABELS:
    - 0
    - 1
    IOU_THRESHOLDS:
    - 0.5
    NAME: Res5ROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 80
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: true
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS:
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: false
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: ''
    NUM_CONV: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id002
    BOUNDARY_THRESH: -1
    CONV_DIMS:
    - -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES:
    - res4
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.3
    - 0.7
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 2000
    PRE_NMS_TOPK_TEST: 6000
    PRE_NMS_TOPK_TRAIN: 12000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    COMMON_STRIDE: 4
    CONVS_DIM: 128
    IGNORE_VALUE: 255
    IN_FEATURES:
    - p2
    - p3
    - p4
    - p5
    LOSS_WEIGHT: 1.0
    NAME: SemSegFPNHead
    NORM: GN
    NUM_CLASSES: 54
  SPARSE_INST:
    CLS_THRESHOLD: 0.005
    DATASET_MAPPER: SparseInstDatasetMapper
    DECODER:
      GROUPS: 4
      INST:
        CONVS: 4
        DIM: 256
      KERNEL_DIM: 128
      MASK:
        CONVS: 4
        DIM: 256
      NAME: GroupIAMDecoder
      NUM_CLASSES: 2
      NUM_MASKS: 100
      OUTPUT_IAM: false
      SCALE_FACTOR: 2.0
    ENCODER:
      IN_FEATURES:
      - res3
      - res4
      - res5
      NAME: InstanceContextEncoder
      NORM: ''
      NUM_CHANNELS: 256
    LOSS:
      CLASS_WEIGHT: 2.0
      ITEMS:
      - labels
      - masks
      MASK_DICE_WEIGHT: 2.0
      MASK_PIXEL_WEIGHT: 5.0
      NAME: SparseInstCriterion
      OBJECTNESS_WEIGHT: 1.0
    MASK_THRESHOLD: 0.45
    MATCHER:
      ALPHA: 0.8
      BETA: 0.2
      NAME: SparseInstMatcher
    MAX_DETECTIONS: 100
  WEIGHTS: pretrained_models/R-50.pkl
OUTPUT_DIR: output/sparse_inst_r50_giam_fp16
SEED: -1
SOLVER:
  AMP:
    ENABLED: true
  AMSGRAD: false
  BACKBONE_MULTIPLIER: 1.0
  BASE_LR: 5.0e-05
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: value
    CLIP_VALUE: 1.0
    ENABLED: false
    NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 8
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 170000
  MOMENTUM: 0.9
  NESTEROV: false
  OPTIMIZER: ADAMW
  REFERENCE_WORLD_SIZE: 0
  STEPS:
  - 210000
  - 250000
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.05
  WEIGHT_DECAY_BIAS: null
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: false
    FLIP: true
    MAX_SIZE: 4000
    MIN_SIZES:
    - 400
    - 500
    - 600
    - 700
    - 800
    - 900
    - 1000
    - 1100
    - 1200
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 7330
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: false
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0

[09/29 01:09:31] detectron2 INFO: Full config saved to output/sparse_inst_r50_giam_fp16/config.yaml
[09/29 01:09:31] d2.utils.env INFO: Using a generated random seed 31206096
[09/29 01:09:39] d2.engine.defaults INFO: Model:
SparseInst(
  (backbone): ResNet(
    (stem): BasicStem(
      (conv1): Conv2d(
        3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
        (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
      )
    )
    (res2): Sequential(
      (0): BottleneckBlock(
        (shortcut): Conv2d(
          64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv1): Conv2d(
          64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
        )
        (conv3): Conv2d(
          64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
      )
      (1): BottleneckBlock(
        (conv1): Conv2d(
          256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
        )
        (conv3): Conv2d(
          64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
      )
      (2): BottleneckBlock(
        (conv1): Conv2d(
          256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
        )
        (conv3): Conv2d(
          64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
      )
    )
    (res3): Sequential(
      (0): BottleneckBlock(
        (shortcut): Conv2d(
          256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv1): Conv2d(
          256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv3): Conv2d(
          128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
      )
      (1): BottleneckBlock(
        (conv1): Conv2d(
          512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv3): Conv2d(
          128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
      )
      (2): BottleneckBlock(
        (conv1): Conv2d(
          512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv3): Conv2d(
          128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
      )
      (3): BottleneckBlock(
        (conv1): Conv2d(
          512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
        )
        (conv3): Conv2d(
          128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
      )
    )
    (res4): Sequential(
      (0): BottleneckBlock(
        (shortcut): Conv2d(
          512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
        (conv1): Conv2d(
          512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv3): Conv2d(
          256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
      )
      (1): BottleneckBlock(
        (conv1): Conv2d(
          1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv3): Conv2d(
          256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
      )
      (2): BottleneckBlock(
        (conv1): Conv2d(
          1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv3): Conv2d(
          256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
      )
      (3): BottleneckBlock(
        (conv1): Conv2d(
          1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv3): Conv2d(
          256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
      )
      (4): BottleneckBlock(
        (conv1): Conv2d(
          1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv3): Conv2d(
          256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
      )
      (5): BottleneckBlock(
        (conv1): Conv2d(
          1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
        )
        (conv3): Conv2d(
          256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
        )
      )
    )
    (res5): Sequential(
      (0): BottleneckBlock(
        (shortcut): Conv2d(
          1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
        )
        (conv1): Conv2d(
          1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv3): Conv2d(
          512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
        )
      )
      (1): BottleneckBlock(
        (conv1): Conv2d(
          2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv3): Conv2d(
          512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
        )
      )
      (2): BottleneckBlock(
        (conv1): Conv2d(
          2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
        )
        (conv3): Conv2d(
          512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
        )
      )
    )
  )
  (encoder): InstanceContextEncoder(
    (fpn_laterals): ModuleList(
      (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
      (2): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
    )
    (fpn_outputs): ModuleList(
      (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (ppm): PyramidPoolingModule(
      (stages): ModuleList(
        (0): Sequential(
          (0): AdaptiveAvgPool2d(output_size=(1, 1))
          (1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
        )
        (1): Sequential(
          (0): AdaptiveAvgPool2d(output_size=(2, 2))
          (1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
        )
        (2): Sequential(
          (0): AdaptiveAvgPool2d(output_size=(3, 3))
          (1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
        )
        (3): Sequential(
          (0): AdaptiveAvgPool2d(output_size=(6, 6))
          (1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
        )
      )
      (bottleneck): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
    )
    (fusion): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1))
  )
  (decoder): GroupIAMDecoder(
    (inst_branch): GroupInstanceBranch(
      (inst_convs): Sequential(
        (0): Conv2d(258, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): ReLU(inplace=True)
        (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (3): ReLU(inplace=True)
        (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): ReLU(inplace=True)
        (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (7): ReLU(inplace=True)
      )
      (iam_conv): Conv2d(256, 400, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4)
      (fc): Linear(in_features=1024, out_features=1024, bias=True)
      (cls_score): Linear(in_features=1024, out_features=2, bias=True)
      (mask_kernel): Linear(in_features=1024, out_features=128, bias=True)
      (objectness): Linear(in_features=1024, out_features=1, bias=True)
    )
    (mask_branch): MaskBranch(
      (mask_convs): Sequential(
        (0): Conv2d(258, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): ReLU(inplace=True)
        (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (3): ReLU(inplace=True)
        (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (5): ReLU(inplace=True)
        (6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (7): ReLU(inplace=True)
      )
      (projection): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
    )
  )
  (criterion): SparseInstCriterion(
    (matcher): SparseInstMatcher()
  )
)
[09/29 01:09:39] sparseinst.dataset_mapper INFO: [DatasetMapper] Augmentations used in training: [RandomFlip(), ResizeShortestEdge(short_edge_length=(416, 448, 480, 512, 544, 576, 608, 640), max_size=853, sample_style='choice')]
[09/29 01:09:39] d2.data.datasets.coco INFO: Loaded 1997 images in COCO format from /raid/kirill/test/data/maf_final/train.json
[09/29 01:09:39] d2.data.build INFO: Removed 3 images with no usable annotations. 1994 images left.
[09/29 01:09:39] d2.data.build INFO: Distribution of instances among all 2 categories:
[36m|  category   | #instances   |  category  | #instances   |
|:-----------:|:-------------|:----------:|:-------------|
| colonna_box | 8170         | sphere_box | 7577         |
|             |              |            |              |
|    total    | 15747        |            |              |[0m
[09/29 01:09:39] d2.data.build INFO: Using training sampler TrainingSampler
[09/29 01:09:39] d2.data.common INFO: Serializing 1994 elements to byte tensors and concatenating them all ...
[09/29 01:09:39] d2.data.common INFO: Serialized dataset takes 7.23 MiB
[09/29 01:09:39] d2.solver.build WARNING: SOLVER.STEPS contains values larger than SOLVER.MAX_ITER. These values will be ignored.
[09/29 01:09:39] fvcore.common.checkpoint INFO: [Checkpointer] Loading from pretrained_models/R-50.pkl ...
[09/29 01:09:39] fvcore.common.checkpoint INFO: Reading a file from 'torchvision'
[09/29 01:09:39] d2.checkpoint.c2_model_loading INFO: Following weights matched with submodule backbone:
| Names in Model    | Names in Checkpoint                                                               | Shapes                                          |
|:------------------|:----------------------------------------------------------------------------------|:------------------------------------------------|
| res2.0.conv1.*    | res2.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,1,1)             |
| res2.0.conv2.*    | res2.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| res2.0.conv3.*    | res2.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res2.0.shortcut.* | res2.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res2.1.conv1.*    | res2.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,256,1,1)            |
| res2.1.conv2.*    | res2.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| res2.1.conv3.*    | res2.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res2.2.conv1.*    | res2.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,256,1,1)            |
| res2.2.conv2.*    | res2.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)             |
| res2.2.conv3.*    | res2.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,64,1,1)        |
| res3.0.conv1.*    | res3.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,256,1,1)       |
| res3.0.conv2.*    | res3.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.0.conv3.*    | res3.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res3.0.shortcut.* | res3.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,256,1,1)       |
| res3.1.conv1.*    | res3.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| res3.1.conv2.*    | res3.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.1.conv3.*    | res3.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res3.2.conv1.*    | res3.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| res3.2.conv2.*    | res3.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.2.conv3.*    | res3.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res3.3.conv1.*    | res3.3.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,512,1,1)       |
| res3.3.conv2.*    | res3.3.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3)       |
| res3.3.conv3.*    | res3.3.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,128,1,1)       |
| res4.0.conv1.*    | res4.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,512,1,1)       |
| res4.0.conv2.*    | res4.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.0.conv3.*    | res4.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.0.shortcut.* | res4.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (1024,) (1024,) (1024,) (1024,) (1024,512,1,1)  |
| res4.1.conv1.*    | res4.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.1.conv2.*    | res4.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.1.conv3.*    | res4.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.2.conv1.*    | res4.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.2.conv2.*    | res4.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.2.conv3.*    | res4.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.3.conv1.*    | res4.3.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.3.conv2.*    | res4.3.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.3.conv3.*    | res4.3.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.4.conv1.*    | res4.4.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.4.conv2.*    | res4.4.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.4.conv3.*    | res4.4.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res4.5.conv1.*    | res4.5.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,1024,1,1)      |
| res4.5.conv2.*    | res4.5.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3)       |
| res4.5.conv3.*    | res4.5.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)  |
| res5.0.conv1.*    | res5.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,1024,1,1)      |
| res5.0.conv2.*    | res5.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3)       |
| res5.0.conv3.*    | res5.0.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)  |
| res5.0.shortcut.* | res5.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (2048,) (2048,) (2048,) (2048,) (2048,1024,1,1) |
| res5.1.conv1.*    | res5.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,2048,1,1)      |
| res5.1.conv2.*    | res5.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3)       |
| res5.1.conv3.*    | res5.1.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)  |
| res5.2.conv1.*    | res5.2.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,2048,1,1)      |
| res5.2.conv2.*    | res5.2.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3)       |
| res5.2.conv3.*    | res5.2.conv3.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)  |
| stem.conv1.*      | stem.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}      | (64,) (64,) (64,) (64,) (64,3,7,7)              |
[09/29 01:09:39] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint:
[34mdecoder.inst_branch.cls_score.{bias, weight}[0m
[34mdecoder.inst_branch.fc.{bias, weight}[0m
[34mdecoder.inst_branch.iam_conv.{bias, weight}[0m
[34mdecoder.inst_branch.inst_convs.0.{bias, weight}[0m
[34mdecoder.inst_branch.inst_convs.2.{bias, weight}[0m
[34mdecoder.inst_branch.inst_convs.4.{bias, weight}[0m
[34mdecoder.inst_branch.inst_convs.6.{bias, weight}[0m
[34mdecoder.inst_branch.mask_kernel.{bias, weight}[0m
[34mdecoder.inst_branch.objectness.{bias, weight}[0m
[34mdecoder.mask_branch.mask_convs.0.{bias, weight}[0m
[34mdecoder.mask_branch.mask_convs.2.{bias, weight}[0m
[34mdecoder.mask_branch.mask_convs.4.{bias, weight}[0m
[34mdecoder.mask_branch.mask_convs.6.{bias, weight}[0m
[34mdecoder.mask_branch.projection.{bias, weight}[0m
[34mencoder.fpn_laterals.0.{bias, weight}[0m
[34mencoder.fpn_laterals.1.{bias, weight}[0m
[34mencoder.fpn_laterals.2.{bias, weight}[0m
[34mencoder.fpn_outputs.0.{bias, weight}[0m
[34mencoder.fpn_outputs.1.{bias, weight}[0m
[34mencoder.fpn_outputs.2.{bias, weight}[0m
[34mencoder.fusion.{bias, weight}[0m
[34mencoder.ppm.bottleneck.{bias, weight}[0m
[34mencoder.ppm.stages.0.1.{bias, weight}[0m
[34mencoder.ppm.stages.1.1.{bias, weight}[0m
[34mencoder.ppm.stages.2.1.{bias, weight}[0m
[34mencoder.ppm.stages.3.1.{bias, weight}[0m
[09/29 01:09:39] fvcore.common.checkpoint WARNING: The checkpoint state_dict contains keys that are not used by the model:
  [35mstem.fc.{bias, weight}[0m
[09/29 01:09:39] d2.engine.train_loop INFO: Starting training from iteration 0
[09/29 01:09:46] d2.utils.events INFO:  eta: 11:40:45  iter: 19  total_loss: 8.062  loss_ce: 2.207  loss_objectness: 0.7034  loss_dice: 1.997  loss_mask: 3.15  time: 0.2915  data_time: 0.1088  lr: 9.9905e-07  max_mem: 3306M
[09/29 01:09:52] d2.utils.events INFO:  eta: 11:37:23  iter: 39  total_loss: 5.251  loss_ce: 2.201  loss_objectness: 0.681  loss_dice: 1.998  loss_mask: 0.3648  time: 0.2725  data_time: 0.0608  lr: 1.998e-06  max_mem: 3306M
[09/29 01:09:57] d2.utils.events INFO:  eta: 11:32:33  iter: 59  total_loss: 4.8  loss_ce: 2.168  loss_objectness: 0.5971  loss_dice: 2  loss_mask: 0.04513  time: 0.2637  data_time: 0.0434  lr: 2.9971e-06  max_mem: 3306M
[09/29 01:10:02] d2.utils.events INFO:  eta: 11:27:17  iter: 79  total_loss: 4.495  loss_ce: 2.049  loss_objectness: 0.3802  loss_dice: 2  loss_mask: 0.06612  time: 0.2663  data_time: 0.0682  lr: 3.9961e-06  max_mem: 3306M
[09/29 01:10:07] d2.utils.events INFO:  eta: 11:22:18  iter: 99  total_loss: 3.744  loss_ce: 1.58  loss_objectness: 0.07274  loss_dice: 2  loss_mask: 0.1108  time: 0.2642  data_time: 0.0532  lr: 4.9951e-06  max_mem: 3306M
[09/29 01:10:12] d2.utils.events INFO:  eta: 11:14:59  iter: 119  total_loss: 3.076  loss_ce: 0.9848  loss_objectness: 0.006809  loss_dice: 2  loss_mask: 0.06297  time: 0.2605  data_time: 0.0441  lr: 5.9941e-06  max_mem: 3306M
[09/29 01:10:17] d2.utils.events INFO:  eta: 11:10:59  iter: 139  total_loss: 2.884  loss_ce: 0.8411  loss_objectness: 0.01047  loss_dice: 1.996  loss_mask: 0.03778  time: 0.2582  data_time: 0.0453  lr: 6.9931e-06  max_mem: 3306M
[09/29 01:10:22] d2.utils.events INFO:  eta: 11:09:34  iter: 159  total_loss: 2.858  loss_ce: 0.8003  loss_objectness: 0.00477  loss_dice: 2  loss_mask: 0.04612  time: 0.2578  data_time: 0.0608  lr: 7.9921e-06  max_mem: 3306M
[09/29 01:10:27] d2.utils.events INFO:  eta: 11:09:09  iter: 179  total_loss: 3  loss_ce: 0.9067  loss_objectness: 0.01053  loss_dice: 2  loss_mask: 0.08507  time: 0.2577  data_time: 0.0532  lr: 8.9911e-06  max_mem: 3306M
[09/29 01:10:33] d2.utils.events INFO:  eta: 11:12:19  iter: 199  total_loss: 2.951  loss_ce: 0.8948  loss_objectness: 0.007725  loss_dice: 2  loss_mask: 0.05398  time: 0.2589  data_time: 0.0681  lr: 9.9901e-06  max_mem: 3306M
[09/29 01:10:38] d2.utils.events INFO:  eta: 11:13:31  iter: 219  total_loss: 2.849  loss_ce: 0.811  loss_objectness: 0.03054  loss_dice: 1.976  loss_mask: 0.02872  time: 0.2595  data_time: 0.0613  lr: 1.0989e-05  max_mem: 3306M
[09/29 01:10:43] d2.utils.events INFO:  eta: 11:13:26  iter: 239  total_loss: 2.832  loss_ce: 0.799  loss_objectness: 0.04742  loss_dice: 1.962  loss_mask: 0.01929  time: 0.2601  data_time: 0.0579  lr: 1.1988e-05  max_mem: 3306M
[09/29 01:10:48] d2.utils.events INFO:  eta: 11:13:21  iter: 259  total_loss: 2.75  loss_ce: 0.7074  loss_objectness: 0.1952  loss_dice: 1.841  loss_mask: 0.02929  time: 0.2594  data_time: 0.0469  lr: 1.2987e-05  max_mem: 3306M
[09/29 01:10:53] d2.utils.events INFO:  eta: 11:12:49  iter: 279  total_loss: 2.748  loss_ce: 0.7283  loss_objectness: 0.197  loss_dice: 1.819  loss_mask: 0.02512  time: 0.2591  data_time: 0.0560  lr: 1.3986e-05  max_mem: 3306M
[09/29 01:10:59] d2.utils.events INFO:  eta: 11:10:31  iter: 299  total_loss: 2.764  loss_ce: 0.7646  loss_objectness: 0.2209  loss_dice: 1.722  loss_mask: 0.01942  time: 0.2595  data_time: 0.0649  lr: 1.4985e-05  max_mem: 3306M
[09/29 01:11:04] d2.utils.events INFO:  eta: 11:13:07  iter: 319  total_loss: 2.743  loss_ce: 0.7367  loss_objectness: 0.2909  loss_dice: 1.681  loss_mask: 0.02197  time: 0.2597  data_time: 0.0602  lr: 1.5984e-05  max_mem: 3306M
[09/29 01:11:09] d2.utils.events INFO:  eta: 11:12:34  iter: 339  total_loss: 2.78  loss_ce: 0.8159  loss_objectness: 0.2865  loss_dice: 1.65  loss_mask: 0.01936  time: 0.2598  data_time: 0.0640  lr: 1.6983e-05  max_mem: 3306M
[09/29 01:11:14] d2.utils.events INFO:  eta: 11:12:56  iter: 359  total_loss: 2.814  loss_ce: 0.8537  loss_objectness: 0.2925  loss_dice: 1.646  loss_mask: 0.02119  time: 0.2595  data_time: 0.0519  lr: 1.7982e-05  max_mem: 3306M
[09/29 01:11:20] d2.utils.events INFO:  eta: 11:13:43  iter: 379  total_loss: 2.674  loss_ce: 0.7588  loss_objectness: 0.3244  loss_dice: 1.58  loss_mask: 0.01589  time: 0.2597  data_time: 0.0611  lr: 1.8981e-05  max_mem: 3306M
[09/29 01:11:25] d2.utils.events INFO:  eta: 11:13:08  iter: 399  total_loss: 2.684  loss_ce: 0.7729  loss_objectness: 0.3391  loss_dice: 1.564  loss_mask: 0.02145  time: 0.2591  data_time: 0.0514  lr: 1.998e-05  max_mem: 3306M
[09/29 01:11:30] d2.utils.events INFO:  eta: 11:13:04  iter: 419  total_loss: 2.67  loss_ce: 0.7592  loss_objectness: 0.328  loss_dice: 1.581  loss_mask: 0.01416  time: 0.2589  data_time: 0.0525  lr: 2.0979e-05  max_mem: 3306M
[09/29 01:11:35] d2.utils.events INFO:  eta: 11:12:37  iter: 439  total_loss: 2.728  loss_ce: 0.8217  loss_objectness: 0.3437  loss_dice: 1.535  loss_mask: 0.01109  time: 0.2586  data_time: 0.0579  lr: 2.1978e-05  max_mem: 3306M
[09/29 01:11:40] d2.utils.events INFO:  eta: 11:12:54  iter: 459  total_loss: 2.696  loss_ce: 0.8166  loss_objectness: 0.3379  loss_dice: 1.539  loss_mask: 0.01454  time: 0.2581  data_time: 0.0453  lr: 2.2977e-05  max_mem: 3306M
[09/29 01:11:40] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
  File "/raid/kirill/test/venv/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/raid/kirill/test/venv/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/raid/kirill/test/venv/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
    loss_dict = self.model(data)
  File "/raid/kirill/test/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/kirill/test/SparseInst/./sparseinst/sparseinst.py", line 107, in forward
    losses = self.criterion(output, targets, max_shape)
  File "/raid/kirill/test/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/kirill/test/SparseInst/./sparseinst/loss.py", line 184, in forward
    indices = self.matcher(outputs_without_aux, targets, input_shape)
  File "/raid/kirill/test/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/kirill/test/SparseInst/./sparseinst/loss.py", line 301, in forward
    indices = [linear_sum_assignment(c[i], maximize=True)
  File "/raid/kirill/test/SparseInst/./sparseinst/loss.py", line 301, in <listcomp>
    indices = [linear_sum_assignment(c[i], maximize=True)
ValueError: matrix contains invalid numeric entries
[09/29 01:11:40] d2.engine.hooks INFO: Overall training speed: 461 iterations in 0:01:59 (0.2582 s / it)
[09/29 01:11:40] d2.engine.hooks INFO: Total training time: 0:01:59 (0:00:00 on hooks)
[09/29 01:11:40] d2.utils.events INFO:  eta: 11:12:44  iter: 463  total_loss: 2.682  loss_ce: 0.8189  loss_objectness: 0.3674  loss_dice: 1.459  loss_mask: 0.0153  time: 0.2578  data_time: 0.0435  lr: 2.3127e-05  max_mem: 3306M

I read this issue. However, this error still occurs. Are there any steps to avoid it?

kirillkoncha avatar Sep 28 '22 22:09 kirillkoncha

Hi @kirillkoncha, sorry for the late reply. Maybe you can fix it by:

iam_prob = iam_prob.view(B, N, -1)
normalizer = iam_prob.sum(-1).float().clamp(min=1e-4)
iam_prob = iam_prob / normalizer[:, :, None]

Mostly, the NaN errors occur due to the fp16 training.

wondervictor avatar Oct 27 '22 02:10 wondervictor

Hi @kirillkoncha, sorry for the late reply. Maybe you can fix it by:

iam_prob = iam_prob.view(B, N, -1)
normalizer = iam_prob.sum(-1).float().clamp(min=1e-4)
iam_prob = iam_prob / normalizer[:, :, None]

Mostly, the NaN errors occur due to the fp16 training.

still error

MrL-CV avatar Nov 18 '22 08:11 MrL-CV

Hi all, if you meet this problem, you could try this:

iam_prob = F.softmax(torch.logsigmoid(iam.view(B, N, 1)), dim=-1)

which can avoid the problem of numerical stability.

wondervictor avatar Jan 03 '23 06:01 wondervictor