Why different between train and inference during the date association?
I found that the data association methods used in the training process and the inference process are different. Why they are different?
The training and inference are not the same but this is very common for MOT methods. Are you referring to a specific aspect?
Thanks for your reply! Section 3.2 of your paper mentions that 'The simultaneous Transformer decoding of object and track queries allows our model to perform detection and tracking in a unified way', I cannot understand the 'unified way' mentioned in your paper, Is it means that the detection and tracking in a same model?
Yes, the same model performs detection and tracking all within the attention layers of the decoder. Hence, both tasks are jointly end-to-end trainable.
@timmeinhardt Thanks for your reply! When I try to train the model to reproduce your result, but the metrics such as MOTA and IDF1 are one percentage point less than your given. And the parameters used in my experiment as follows, Could you help me see what's wrong with it?
Configuration (modified, added, typechanged, doc):
aux_loss = True
backbone = 'resnet50'
batch_size = 2
bbox_loss_coef = 5.0
clip_max_norm = 0.1
cls_loss_coef = 2.0
coco_and_crowdhuman_prev_frame_rnd_augs = 0.2
coco_min_num_objects = 0
coco_panoptic_path = None
coco_path = 'data/coco_2017'
coco_person_train_split = None
crowdhuman_path = '/media/valca3090/back-up/QX/trackformer/data/CrowdHuman'
crowdhuman_train_split = 'train_val'
dataset = 'mot_crowdhuman'
debug = False
dec_layers = 6
dec_n_points = 4
deformable = True
device = 'cuda'
dice_loss_coef = 1.0
dilation = False
dim_feedforward = 1024
dist_url = 'env://'
dropout = 0.1
enc_layers = 6
enc_n_points = 4
eos_coef = 0.1
epochs = 40
eval_only = False
eval_train = False
focal_alpha = 0.25
focal_gamma = 2
focal_loss = True
freeze_detr = False
giou_loss_coef = 2
hidden_dim = 288
load_mask_head_from_model = None
lr = 0.0002
lr_backbone = 2e-05
lr_backbone_names = ['backbone.0']
lr_drop = 10
lr_linear_proj_mult = 0.1
lr_linear_proj_names = ['reference_points', 'sampling_offsets']
lr_track = 0.0001
mask_loss_coef = 1.0
masks = False
merge_frame_features = False
mot_path_train = '/media/valca3090/back-up/QX/trackformer/data/MOT17'
mot_path_val = '/media/valca3090/back-up/QX/trackformer/data/MOT17'
multi_frame_attention = True
multi_frame_attention_separate_encoder = True
multi_frame_encoding = True
nheads = 8
no_vis = False
num_feature_levels = 4
num_queries = 500
num_workers = 2
output_dir = '/media/valca3090/back-up/QX/trackformer/models/mot17_crowdhuman_deformable_multi_frame'
overflow_boxes = True
overwrite_lr_scheduler = False
overwrite_lrs = False
position_embedding = 'sine'
pre_norm = False
resume = '//media/valca3090/back-up/QX/trackformer/models/trackformer_models_v1/crowdhuman_deformable_multi_frame/checkpoint_epoch_80.pth'
resume_optim = False
resume_shift_neuron = False
resume_vis = False
save_model_interval = 5
seed = 42
set_cost_bbox = 5.0
set_cost_class = 2.0
set_cost_giou = 2.0
start_epoch = 1
track_attention = False
track_backprop_prev_frame = False
track_prev_frame_range = 5
track_prev_frame_rnd_augs = 0.01
track_prev_prev_frame = False
track_query_false_negative_prob = 0.4
track_query_false_positive_eos_weight = True
track_query_false_positive_prob = 0.1
tracking = True
tracking_eval = True
train_split = 'mot17_train_coco'
two_stage = False
val_interval = 5
val_split = 'mot17_train_cross_val_frame_0_5_to_1_0_coco'
vis_and_log_interval = 50
vis_port = 8090
vis_server = ''
weight_decay = 0.0001
with_box_refine = True
world_size = 1
img_transform:
max_size = 1333
val_width = 800
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5.0, clip_max_norm=0.1, cls_loss_coef=2.0, coco_and_crowdhuman_prev_frame_rnd_augs=0.2, coco_min_num_objects=0, coco_panoptic_path=None, coco_path='data/coco_2017', coco_person_train_split=None, crowdhuman_path='/media/valca3090/back-up/QX/trackformer/data/CrowdHuman', crowdhuman_train_split='train_val', dataset='mot_crowdhuman', debug=False, dec_layers=6, dec_n_points=4, deformable=True, device='cuda', dice_loss_coef=1.0, dilation=False, dim_feedforward=1024, dist_url='env://', dropout=0.1, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=50, eval_only=False, eval_train=False, focal_alpha=0.25, focal_gamma=2, focal_loss=True, freeze_detr=False, giou_loss_coef=2, hidden_dim=288, img_transform=Namespace(max_size=1333, val_width=800), load_mask_head_from_model=None, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=6, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_track=0.0001, mask_loss_coef=1.0, masks=False, merge_frame_features=False, mot_path_train='/media/valca3090/back-up/QX/trackformer/data/MOT17', mot_path_val='/media/valca3090/back-up/QX/trackformer/data/MOT17', multi_frame_attention=True, multi_frame_attention_separate_encoder=True, multi_frame_encoding=True, nheads=8, no_vis=False, num_feature_levels=4, num_queries=500, num_workers=2, output_dir='/media/valca3090/back-up/QX/trackformer/models/mot17_crowdhuman_deformable_multi_frame', overflow_boxes=True, overwrite_lr_scheduler=False, overwrite_lrs=False, position_embedding='sine', pre_norm=False, resume='//media/valca3090/back-up/QX/trackformer/models/mot17_crowdhuman_deformable_multi_frame/checkpoint.pth', resume_optim=False, resume_shift_neuron=False, resume_vis=False, save_model_interval=5, seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, start_epoch=5, track_attention=False, track_backprop_prev_frame=False, track_prev_frame_range=5, track_prev_frame_rnd_augs=0.01, track_prev_prev_frame=False, track_query_false_negative_prob=0.4, track_query_false_positive_eos_weight=True, track_query_false_positive_prob=0.1, tracking=True, tracking_eval=True, train_split='mot17_train_coco', two_stage=False, val_interval=5, val_split='mot17_train_cross_val_frame_0_5_to_1_0_coco', vis_and_log_interval=50, vis_port=8090, vis_server='', weight_decay=0.0001, with_box_refine=True, world_size=1)
Not using distributed mode
Can u send the train and eval commands that you are executing?
Train commands I used:
python src/train.py with \ mot17_crowdhuman \ deformable \ multi_frame \ tracking \ output_dir=models/mot17_crowdhuman_deformable_multi_frame \
eval commands I used:
python src/track.py with reid
These look fine. And what results are you trying to obtain? In this issue you seem to be trying to reproduce the training set numbers https://github.com/timmeinhardt/trackformer/issues/46#issuecomment-1229689775. The code is non-deterministic so some noise w.r.t. to the final score is to be expected. Have you tried a test set submission?
Hi, I want to reproduce the result mentioned in the document README.md.
I ran the python src/track.py with reid and saved the result by modified the output_dir in the config file track.yaml, then i submitted the output to the https://motchallenge.net,but it cannot reproduce the result on the test set. Is there a problem with the operation of my appeal?
Did you retrain the model or use our pretrained model file? And what MOTA score did u obtain? something aroung 73 ?
I retrained the model as command of document of TRAIN.md. And the MOTA scores are as follows:
Public detection: train 64.7% test 61.23%
Private detection: train 73.4% test 70.48%
There is some noise w.r.t. to the final scores but it should not be 4 points as it is for your private detection results. Did you train separate models for public and private? The latter should load the pretrained CrowdHuman model and then do a joint training on MOT and CrowdHuman. Did this happen? To rule out evaluation errors, you could evaluate our provided pretrained models and check if you at least reproduce those results.
I did all as you said and nothing wrong evaluate on your pretrained models. I would like to know if it is related to the GPU, I trained it only on a 3090ti.
Did you submit the results from our pretrained models and obtained the same numbers? Just to be sure it is not related to the evaluation part of the pipeline.
How many GBs of memory does your GPU have? In our trainings 24 GB was not enough. Did you adjust any parameters to fit it on a 24 GB GPU?
yes, I submitted the results from your pretrained models and get the same numbers. My GPU has 24 GB, thanks to the limited memory, I tried two methods to fit it on my GPU. One is adjust the max_size of every image from 1388 to 1080, another is adjust the batch_size from 2 to 1. But they all performed not as good as your pretrained models on private detection. Will these make a big difference?
These changes can definitely make a difference and you should mention any changes you made when reporting an issue. I would not touch the max_size parameters to fit the training on your GPU but work with a reduced batch size. However, this means the learning rate and iterations might also need adjustment. You could try running with batch_size=1 and half of the learning rates.
Thanks for your reply, I'm sorry for the trouble I brought you with my unclear presentation.I will try the method you mentioned. All in all, Thank you!