AlignPS icon indicating copy to clipboard operation
AlignPS copied to clipboard

Whether to support distributed training

Open dqshuai opened this issue 3 years ago • 17 comments

hello,thanks for your project. i want to know whether to support distributed training. And what should i do to make it support distributed training.

dqshuai avatar Apr 09 '21 06:04 dqshuai

Hi, we didn't try to train with multiple GPUs. But MMDetection supports distributed training, please refer to https://github.com/daodaofr/AlignPS/blob/master/tools/dist_train.sh

daodaofr avatar Apr 09 '21 06:04 daodaofr

thanks for your reply. now, i try to support distributed training using cmd "./tools/dist_train.sh configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py 8 --launcher pytorch --no-validate". it can traniing normally, but i don‘t know it whether it will affect the final performance. Normally, distributed training will not damage the performance? is this correct?

dqshuai avatar Apr 09 '21 06:04 dqshuai

Normally, you can still get fair performance, maybe there needs some adjusting in batch size and learning rate to get the best results.

daodaofr avatar Apr 09 '21 07:04 daodaofr

hi,I just finished training using multi gpu on prw dataset. Compared to the results of the paper, map is 2% lower, but r1 is 1% higher. when i check the config, i found the bbox_head is ''FCOSReidHeadFocalOimSub'' without triplet loss.https://github.com/daodaofr/AlignPS/blob/c20cf329b2934a8693e2064435d3e3f65c496095/configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py#L11 I want to know whether the difference in results is related to this, and no ablation experiment in this regard was found in your paper. thanks!

dqshuai avatar Apr 09 '21 12:04 dqshuai

Thanks for your results, I think the results are normal. According to my experience, the triplet loss only has a very slight influence on PRW, less than 1%. Different environments (mmcv, pytorch, cuda) can also bring 1%-2% performance difference. PRW is smaller compared to CUHK-SYSU, so it is normal to see some fluctuations.

daodaofr avatar Apr 09 '21 13:04 daodaofr

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions!thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)

dqshuai avatar Apr 13 '21 12:04 dqshuai

I am sorry, but I haven't tried distributed training. So I cannot give practical suggestions on that. If you want to reproduce the results, please try to use a single GPU.

daodaofr avatar Apr 13 '21 13:04 daodaofr

thanks for your reply. I received system email, in which you suggest to use all gather to update lookup_table with global features. The example you provided has some problems due to the inconsistency of the feature size of each rank. I made some modifications, and then adjusted the learning rate, the current map can reach 92.91. Why did I not see this reply in the issue, and are there any other details that I need to pay attention to to get a higher map?

dqshuai avatar Apr 18 '21 02:04 dqshuai

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply. It would be nice if you could give an example of your modified code, to help others with distributed training. Maybe more epochs are needed with multiple GPU.

daodaofr avatar Apr 18 '21 06:04 daodaofr

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply. It would be nice if you could give an example of your modified code, to help others with distributed training. Maybe more epochs are needed with multiple GPU.

My current implementation is a bit ugly. :)

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return x
    if not save_memory:
        # all gather features in parallel
        # cost more GPU memory but less time
        # x = x.cuda(gpu)
        x_gather = [torch.empty_like(x) for _ in range(world_size)]
        dist.all_gather(x_gather, x, async_op=False)
#         x_gather = torch.cat(x_gather, dim=0)
    else:
        # broadcast features in sequence
        # cost more time but less GPU memory
        container = torch.empty_like(x).cuda(gpu)
        x_gather = []
        for k in range(world_size):
            container.data.copy_(x)
            print("gathering features from rank no.{}".format(k))
            dist.broadcast(container, k)
            x_gather.append(container.cpu())
#         x_gather = torch.cat(x_gather, dim=0)
        # return cpu tensor
    return x_gather
def undefined_l_gather(features,pid_labels):
    resized_num = 10000
    pos_num = min(features.size(0),resized_num)
    if features.size(0)>resized_num:
        print(f'{features.size(0)}out of {resized_num}')
    resized_features = torch.empty((resized_num,features.size(1))).to(features.device)
    resized_features[:pos_num,:] = features[:pos_num,:]
    resized_pid_labels = torch.empty((resized_num,)).to(pid_labels.device)
    resized_pid_labels[:pos_num] = pid_labels[:pos_num]
    pos_num = torch.tensor([pos_num]).to(features.device)
    all_pos_num = all_gather_tensor(pos_num)
    all_features = all_gather_tensor(resized_features)
    all_pid_labels = all_gather_tensor(resized_pid_labels)
    gather_features = []
    gather_pid_labels = []
    for index,p_num in enumerate(all_pos_num):
        gather_features.append(all_features[index][:p_num,:])
        gather_pid_labels.append(all_pid_labels[index][:p_num])
    gather_features = torch.cat(gather_features,dim=0)
    gather_pid_labels = torch.cat(gather_pid_labels,dim=0)
    return gather_features,gather_pid_labels
class LabeledMatching(Function):
    @staticmethod
    def forward(ctx, features, pid_labels, lookup_table, momentum=0.5):
        # The lookup_table can't be saved with ctx.save_for_backward(), as we would
        # modify the variable which has the same memory address in backward()
#         ctx.save_for_backward(features, pid_labels)
        gather_features,gather_pid_labels = undefined_l_gather(features,pid_labels)
        ctx.save_for_backward(gather_features, gather_pid_labels)  
        ctx.lookup_table = lookup_table
        ctx.momentum = momentum
        scores = features.mm(lookup_table.t())
        #print(features, lookup_table, scores)
        pos_feats = lookup_table.clone().detach()
        pos_idx = pid_labels > 0
        pos_pids = pid_labels[pos_idx]
        pos_feats = pos_feats[pos_pids]
        #pos_feats.require_grad = False
        return scores, pos_feats, pos_pids

    @staticmethod
    def backward(ctx, grad_output, grad_feat, grad_pids):
        features, pid_labels = ctx.saved_tensors
        pid_labels = pid_labels.long()
        lookup_table = ctx.lookup_table
        momentum = ctx.momentum
        grad_feats = None
        if ctx.needs_input_grad[0]:
            grad_feats = grad_output.mm(lookup_table)
        # Update lookup table, but not by standard backpropagation with gradients
        for indx, label in enumerate(pid_labels):
            if label >= 0:
                lookup_table[label] = (
                    momentum * lookup_table[label] + (1 - momentum) * features[indx]
                )
                #lookup_table[label] /= lookup_table[label].norm()
        return grad_feats, None, None, None

dqshuai avatar Apr 18 '21 06:04 dqshuai

Great! Thanks :)

daodaofr avatar Apr 18 '21 07:04 daodaofr

I think all_gather_tensor should return a list if is_dist is false

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return [x]
    # remaining code here...

anDoer avatar Apr 22 '21 13:04 anDoer

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions!thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

hh23333 avatar Apr 26 '21 13:04 hh23333

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions!thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

(1)my GPUs' num is 8,and the batch_size of each gpu is 4. When i set lr=0.05,i get 92.91 mAP. At first,I thought that the empirical ratio lr is about single_gpu_lr(0.001)*num_of_gpus. But,I don't get a better result when i using 0.008 or 0.01 lr. (2)using sync_batchnorm reduces the result. and i don‘t know why. If you have any other findings, you can share it with me. I haven't fully reproduced the results of the paper with multi gpu. Thanks!

dqshuai avatar Apr 26 '21 14:04 dqshuai

Got it, Thanks!

hh23333 avatar Apr 27 '21 03:04 hh23333

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse. I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

qixiong-wang avatar Aug 13 '21 15:08 qixiong-wang

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse. I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

This is just my try, but it doesn't work out.

daodaofr avatar Aug 16 '21 06:08 daodaofr