mmengine icon indicating copy to clipboard operation
mmengine copied to clipboard

[Bug] DeepSpeedStrategy load_checkpoint `strict`&`load_module_only`

Open mypydl opened this issue 1 year ago • 2 comments

Prerequisite

  • [X] I have searched Issues and Discussions but cannot get the expected help.
  • [X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

OrderedDict([('sys.platform', 'linux'),
             ('Python', '3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0]'),
             ('CUDA available', True),
             ('numpy_random_seed', 2147483648),
             ('GPU 0,1,2,3,4,5,6,7', 'NVIDIA GeForce RTX 2080 Ti'),
             ('CUDA_HOME', '/usr/local/cuda'),
             ('NVCC', 'Cuda compilation tools, release 11.8, V11.8.89'),
             ('GCC', 'gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0'),
             ('PyTorch', '2.0.0'),
             ('PyTorch compiling details',
              'PyTorch built with:\n'
              '  - GCC 9.3\n'
              '  - C++ Version: 201703\n'
              '  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product '
              'Build 20230303 for Intel(R) 64 architecture applications\n'
              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash '
              '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
              '  - LAPACK is enabled (usually provided by MKL)\n'
              '  - NNPACK is enabled\n'
              '  - CPU capability usage: AVX2\n'
              '  - CUDA Runtime 11.8\n'
              '  - NVCC architecture flags: '
              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
              '  - CuDNN 8.7\n'
              '  - Magma 2.6.1\n'
              '  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, '
              'CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, '
              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= '
              '-D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated '
              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG '
              '-DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK '
              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra '
              '-Werror=return-type -Werror=non-virtual-dtor '
              '-Werror=bool-operation -Wnarrowing '
              '-Wno-missing-field-initializers -Wno-type-limits '
              '-Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs '
              '-Wno-unused-parameter -Wno-unused-function -Wno-unused-result '
              '-Wno-strict-overflow -Wno-strict-aliasing '
              '-Wno-error=deprecated-declarations -Wno-stringop-overflow '
              '-Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls '
              '-Wno-error=old-style-cast -fdiagnostics-color=always '
              '-faligned-new -Wno-unused-but-set-variable '
              '-Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math '
              '-Werror=format -Werror=cast-function-type '
              '-Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, '
              'PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, '
              'TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, '
              'USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, '
              'USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'),
             ('TorchVision', '0.15.0'),
             ('OpenCV', '4.7.0'),
             ('MMEngine', '0.8.1')])

Reproduces the problem - code sample

mmengine._strategy.deepspeed.DeepSpeedStrategy.load_checkpoint

Reproduces the problem - command or script

NA

Reproduces the problem - error message

NA

Additional information

The param strict is unused. It will raise error when the state_dict doesn't match the model strictly.

An additional parameter load_module_only should also be added, otherwise some additional key-value pairs will need to be configured for the checkpoint.

Overall, why not load_pretrain_checkpoint before _wrap_model? Just like:

if args.pre_load_checkpoint:
    model = model_class.from_pretrained(args.model_name_or_path)
else:
    model = model_class()

Another suggestion: (work in stage-2, but not in stage-3)

    def load_checkpoint(
        self,
        filename: str,
        *,
        map_location: Union[str, Callable] = 'cpu',
        strict: bool = False,
        load_module_only: bool = True,  # add this
        revise_keys: list = [(r'^module.', '')],
        callback: Optional[Callable] = None,
    ) -> dict:
        """Load checkpoint from given ``filename``.

        Warning:
            `map_localtion` and `callback` parameters are not supported yet.

        Args:
            filename (str): Accept local filepath, URL, ``torchvision://xxx``,
                ``open-mmlab://xxx``.
        """
        self.logger.info(f'Load checkpoint from {filename}')

        dirname, basename = osp.split(filename)
        _, extra_ckpt = self.model.load_checkpoint(
            dirname, tag=basename, load_optimizer_states=False, 
            load_module_strict=strict, load_module_only=load_module_only)  # add this

mypydl avatar Jul 09 '23 03:07 mypydl

Hi @mypydl , thanks for your report.

  1. assign strict to load_module_strict
  2. need to consider how to pass the strict and load_module_only parameters to load_checkpoint because FlexibleRunner does not accept these setting.
  3. If adding a new load_module_only to the load_checkpoint of DeepSpeedStrategy, maybe also need to add it to other strategies just for interface unified.

zhouzaida avatar Jul 11 '23 05:07 zhouzaida

If the model is partitioned, loading checkpoint before _wrap_model can not work well.

zhouzaida avatar Jul 11 '23 06:07 zhouzaida