trackformer Why different between train and inference during the date association？

I found that the data association methods used in the training process and the inference process are different. Why they are different?

Aug 22 '22 12:08 quxu91

The training and inference are not the same but this is very common for MOT methods. Are you referring to a specific aspect?

Aug 22 '22 15:08 timmeinhardt

Thanks for your reply! Section 3.2 of your paper mentions that 'The simultaneous Transformer decoding of object and track queries allows our model to perform detection and tracking in a unified way', I cannot understand the 'unified way' mentioned in your paper, Is it means that the detection and tracking in a same model?

Aug 23 '22 02:08 quxu91

Yes, the same model performs detection and tracking all within the attention layers of the decoder. Hence, both tasks are jointly end-to-end trainable.

Aug 23 '22 03:08 timmeinhardt

@timmeinhardt Thanks for your reply! When I try to train the model to reproduce your result, but the metrics such as MOTA and IDF1 are one percentage point less than your given. And the parameters used in my experiment as follows, Could you help me see what's wrong with it?

Configuration (modified, added, typechanged, doc):
  aux_loss = True
  backbone = 'resnet50'
  batch_size = 2
  bbox_loss_coef = 5.0
  clip_max_norm = 0.1
  cls_loss_coef = 2.0
  coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
  coco_min_num_objects = 0
  coco_panoptic_path = None
  coco_path = 'data/coco_2017'
  coco_person_train_split = None
  crowdhuman_path = '/media/valca3090/back-up/QX/trackformer/data/CrowdHuman'
  crowdhuman_train_split = 'train_val'
  dataset = 'mot_crowdhuman'
  debug = False
  dec_layers = 6
  dec_n_points = 4
  deformable = True
  device = 'cuda'
  dice_loss_coef = 1.0
  dilation = False
  dim_feedforward = 1024
  dist_url = 'env://'
  dropout = 0.1
  enc_layers = 6
  enc_n_points = 4
  eos_coef = 0.1
  epochs = 40
  eval_only = False
  eval_train = False
  focal_alpha = 0.25
  focal_gamma = 2
  focal_loss = True
  freeze_detr = False
  giou_loss_coef = 2
  hidden_dim = 288
  load_mask_head_from_model = None
  lr = 0.0002
  lr_backbone = 2e-05
  lr_backbone_names = ['backbone.0']
  lr_drop = 10
  lr_linear_proj_mult = 0.1
  lr_linear_proj_names = ['reference_points', 'sampling_offsets']
  lr_track = 0.0001
  mask_loss_coef = 1.0
  masks = False
  merge_frame_features = False
  mot_path_train = '/media/valca3090/back-up/QX/trackformer/data/MOT17'
  mot_path_val = '/media/valca3090/back-up/QX/trackformer/data/MOT17'
  multi_frame_attention = True
  multi_frame_attention_separate_encoder = True
  multi_frame_encoding = True
  nheads = 8
  no_vis = False
  num_feature_levels = 4
  num_queries = 500
  num_workers = 2
  output_dir = '/media/valca3090/back-up/QX/trackformer/models/mot17_crowdhuman_deformable_multi_frame'
  overflow_boxes = True
  overwrite_lr_scheduler = False
  overwrite_lrs = False
  position_embedding = 'sine'
  pre_norm = False
  resume = '//media/valca3090/back-up/QX/trackformer/models/trackformer_models_v1/crowdhuman_deformable_multi_frame/checkpoint_epoch_80.pth'
  resume_optim = False
  resume_shift_neuron = False
  resume_vis = False
  save_model_interval = 5
  seed = 42
  set_cost_bbox = 5.0
  set_cost_class = 2.0
  set_cost_giou = 2.0
  start_epoch = 1
  track_attention = False
  track_backprop_prev_frame = False
  track_prev_frame_range = 5
  track_prev_frame_rnd_augs = 0.01
  track_prev_prev_frame = False
  track_query_false_negative_prob = 0.4
  track_query_false_positive_eos_weight = True
  track_query_false_positive_prob = 0.1
  tracking = True
  tracking_eval = True
  train_split = 'mot17_train_coco'
  two_stage = False
  val_interval = 5
  val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
  vis_and_log_interval = 50
  vis_port = 8090
  vis_server = ''
  weight_decay = 0.0001
  with_box_refine = True
  world_size = 1
  img_transform:
    max_size = 1333
    val_width = 800
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='/media/valca3090/back-up/QX/trackformer/data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=50, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=6, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='/media/valca3090/back-up/QX/trackformer/data/MOT17', mot_path_val='/media/valca3090/back-up/QX/trackformer/data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='/media/valca3090/back-up/QX/trackformer/models/mot17_crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='//media/valca3090/back-up/QX/trackformer/models/mot17_crowdhuman_deformable_multi_frame/checkpoint.pth', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=5, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split='mot17_train_coco', two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8090, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=1)
Not using distributed mode

Aug 29 '22 02:08 quxu91

Can u send the train and eval commands that you are executing?

Aug 29 '22 02:08 timmeinhardt

Train commands I used: python src/train.py with \ mot17_crowdhuman \ deformable \ multi_frame \ tracking \ output_dir=models/mot17_crowdhuman_deformable_multi_frame \ eval commands I used: python src/track.py with reid

Aug 29 '22 03:08 quxu91

These look fine. And what results are you trying to obtain? In this issue you seem to be trying to reproduce the training set numbers https://github.com/timmeinhardt/trackformer/issues/46#issuecomment-1229689775. The code is non-deterministic so some noise w.r.t. to the final score is to be expected. Have you tried a test set submission?

Aug 30 '22 15:08 timmeinhardt

Hi, I want to reproduce the result mentioned in the document README.md. I ran the python src/track.py with reid and saved the result by modified the output_dir in the config file track.yaml, then i submitted the output to the https://motchallenge.net,but it cannot reproduce the result on the test set. Is there a problem with the operation of my appeal?

Sep 01 '22 09:09 quxu91

Did you retrain the model or use our pretrained model file? And what MOTA score did u obtain? something aroung 73 ?

Sep 01 '22 18:09 timmeinhardt

I retrained the model as command of document of TRAIN.md. And the MOTA scores are as follows: Public detection: train 64.7% test 61.23% Private detection: train 73.4% test 70.48%

Sep 02 '22 07:09 quxu91

There is some noise w.r.t. to the final scores but it should not be 4 points as it is for your private detection results. Did you train separate models for public and private? The latter should load the pretrained CrowdHuman model and then do a joint training on MOT and CrowdHuman. Did this happen? To rule out evaluation errors, you could evaluate our provided pretrained models and check if you at least reproduce those results.

Sep 02 '22 15:09 timmeinhardt

I did all as you said and nothing wrong evaluate on your pretrained models. I would like to know if it is related to the GPU, I trained it only on a 3090ti.

Sep 13 '22 02:09 quxu91

Did you submit the results from our pretrained models and obtained the same numbers? Just to be sure it is not related to the evaluation part of the pipeline.

How many GBs of memory does your GPU have? In our trainings 24 GB was not enough. Did you adjust any parameters to fit it on a 24 GB GPU?

Sep 14 '22 23:09 timmeinhardt

yes, I submitted the results from your pretrained models and get the same numbers. My GPU has 24 GB, thanks to the limited memory, I tried two methods to fit it on my GPU. One is adjust the max_size of every image from 1388 to 1080, another is adjust the batch_size from 2 to 1. But they all performed not as good as your pretrained models on private detection. Will these make a big difference?

Sep 19 '22 01:09 quxu91

These changes can definitely make a difference and you should mention any changes you made when reporting an issue. I would not touch the max_size parameters to fit the training on your GPU but work with a reduced batch size. However, this means the learning rate and iterations might also need adjustment. You could try running with batch_size=1 and half of the learning rates.

Sep 19 '22 18:09 timmeinhardt

Thanks for your reply, I'm sorry for the trouble I brought you with my unclear presentation.I will try the method you mentioned. All in all, Thank you!

Sep 20 '22 12:09 quxu91