mmengine [Bug] Same input data, same checkpoint, but different results using single/multiple GPU(s)

Prerequisite

[x] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1,2,3', 'NVIDIA GeForce RTX 2080 Ti'), ('CUDA_HOME', None), ('GCC', 'gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0'), ('PyTorch', '2.0.0'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2023.0-Product Build 20221128 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.2')])

Reproduces the problem - code sample

@METRICS.register_module()
class Accuracy(BaseMetric):
    def __init__(self, mode='val'):
        super(Accuracy, self).__init__()
        self.mode = mode

    def process(self, inputs, data_samples):
        for data_sample in data_samples:
            result = dict(gt_cls_label=data_sample['gt_cls_label'], pred_cls_label=data_sample['pred_cls_label'])
            self.results.append(result)  # self.results is actually the 'results' in the compute_metrics method

    def compute_metrics(self, results: list[dict]) -> dict:
        # this method is for predict and tensor mode
        gt_cls_labels = torch.tensor([result['gt_cls_label'] for result in results])
        pred_cls_labels = torch.tensor([result['pred_cls_label'] for result in results])
        acc = torch.sum(pred_cls_labels == gt_cls_labels) / gt_cls_labels.shape[0]
        if self.mode == 'val':
            return dict(val_acc=acc)
        elif self.mode == 'test':
            return dict(test_acc=acc)
        else:
            raise RuntimeError(f'Invalid mode "{self.mode}". Only supports val and test mode')

Reproduces the problem - command or script

# single GPU:  
python train.py cfg_file
python test.py cfg_file ckpt_file

# multiple GPUs:
torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py cfg_file
torchrun --standalone --nnodes=1 --nproc_per_node=2 test.py cfg_file ckpt_file

Reproduces the problem - error message

# single GPU training; single or multiple GPUs test
val_acc: 0.9386  (this is shown in the training process)
test_acc: 0.9386  (this is shown in the test process)

# multiple GPUs training; single or multiple GPUs test
val_acc: 0.9386  (this is shown in the training process)
test_acc: 0.9234  (this is shown in the test process)

Additional information

My val_dataloader and test_dataloader are exactly the same. When training using a single GPU, the val_acc shown in the training process is the same as the test_acc calculated by the corresponding checkpoint in the test process. However, when training using multiple GPUs, they are different, which are supposed to be the same.

Apr 09 '23 08:04 JunweiZheng93

Thanks for your feedback I guess this is mainly caused by we've not synchronized the model buffer before saving the checkpoint and evaluating the model. We will fix it ASAP （Of course, it could be helpful if you can create a PR to fix this 😆 .）

Apr 10 '23 16:04 HAOCHENYE

I have run into a similar problem with MMEngine.

During training with multiple GPUs, the validated metrics and the test metrics on the same test set with same ckpt are different, but there is no such problem with single GPU training.

Strangely enough, I have synced the buffer via SyncBuffersHook, but I still encounter the above problem, how can I solve it?

Aug 08 '23 08:08 wuwenbin970731

Maybe you could try to set:

model_wrapper_cfg = dict(
    type='...',
    broadcast_buffers=True,
)

in the config file. If it works, we'll check the implementation of SyncBufferHook

Aug 08 '23 09:08 HAOCHENYE

After modifying the relevant settings, e.g. broadcast_buffers=True, sync_bn, SyncBuffersHook, I saved the corresponding ckpts on all the ranks and compared them to the ckpt used during testing.

I found that their parameters are identical.

However, I still get inconsistent results when testing. Is there any implicit difference between calling val_loop during training and calling runner.val() directly?

Aug 08 '23 11:08 wuwenbin970731

After configuring these parameters, did you retrain and save the new checkpoint？

Aug 08 '23 13:08 HAOCHENYE

Yes, I did retrain and save the new checkpoint.

One problem I found was that I tried to save the corresponding predictions in the process function of Metrics, but I found that I didn't save all the predictions.

I also checked the number of run_iter's in ValLoop and after adding up the number of iters on all the ranks it is indeed the number of complete test sets.

This confuses me, I think the reason that the metrics are different is because not all the test sets are being used during validation. And there is randomness that leads to a different number of test sets being used, which leads to inconsistent validation results.

Is this scenario possible? And what are the possible reasons?

Aug 08 '23 13:08 wuwenbin970731

I found the reason for the inconsistency in the test results: Although the number of data tested in different rank is correct, there are overlapping data between different ranks, and there is randomness in the overlapping data, which leads to inconsistent results.

But I haven't figured out the reason for this yet, I already set shuffle in sampler to False.

Aug 08 '23 14:08 wuwenbin970731

It is strange that there are overlapping data between different ranks if you use the default_sampler provided in mmengine.

Aug 09 '23 04:08 HAOCHENYE

My validation dataset is derived from Pytorch Dataset, not inherited from BaseSegDataset. Are there some discrepancies here that lead to the above questions? I tested the ADE Dataset provided by mmseg, the above issue won't happen.

Aug 09 '23 07:08 wuwenbin970731

I eventually pinpointed the problem: my test set's elements are read in a specific way not from local path, and the order of the list of elements is random. When I instantiated this test set on different ranks, the order of the elements in each test set was inconsistent, which led to the above problem. I have now sorted this list of elements according to a fixed rule and the problem is solved.

Thank you very much for your patient answer.

Aug 09 '23 08:08 wuwenbin970731

mmengine mmengine copied to clipboard

[Bug] Same input data, same checkpoint, but different results using single/multiple GPU(s)

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

mmengine
mmengine copied to clipboard