mmengine
mmengine copied to clipboard
[Bug] DeepSpeedStrategy load_checkpoint `strict`&`load_module_only`
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).
Environment
OrderedDict([('sys.platform', 'linux'),
('Python', '3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0]'),
('CUDA available', True),
('numpy_random_seed', 2147483648),
('GPU 0,1,2,3,4,5,6,7', 'NVIDIA GeForce RTX 2080 Ti'),
('CUDA_HOME', '/usr/local/cuda'),
('NVCC', 'Cuda compilation tools, release 11.8, V11.8.89'),
('GCC', 'gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0'),
('PyTorch', '2.0.0'),
('PyTorch compiling details',
'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product '
'Build 20230303 for Intel(R) 64 architecture applications\n'
' - Intel(R) MKL-DNN v2.7.3 (Git Hash '
'6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 11.8\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n'
' - CuDNN 8.7\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, '
'CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= '
'-D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated '
'-fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG '
'-DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK '
'-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
'-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra '
'-Werror=return-type -Werror=non-virtual-dtor '
'-Werror=bool-operation -Wnarrowing '
'-Wno-missing-field-initializers -Wno-type-limits '
'-Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs '
'-Wno-unused-parameter -Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-error=deprecated-declarations -Wno-stringop-overflow '
'-Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls '
'-Wno-error=old-style-cast -fdiagnostics-color=always '
'-faligned-new -Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math '
'-Werror=format -Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, '
'PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, '
'USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, '
'USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'),
('TorchVision', '0.15.0'),
('OpenCV', '4.7.0'),
('MMEngine', '0.8.1')])
Reproduces the problem - code sample
mmengine._strategy.deepspeed.DeepSpeedStrategy.load_checkpoint
Reproduces the problem - command or script
NA
Reproduces the problem - error message
NA
Additional information
The param strict
is unused.
It will raise error when the state_dict doesn't match the model strictly.
An additional parameter load_module_only
should also be added, otherwise some additional key-value pairs will need to be configured for the checkpoint.
Overall, why not load_pretrain_checkpoint before _wrap_model? Just like:
if args.pre_load_checkpoint:
model = model_class.from_pretrained(args.model_name_or_path)
else:
model = model_class()
Another suggestion: (work in stage-2, but not in stage-3)
def load_checkpoint(
self,
filename: str,
*,
map_location: Union[str, Callable] = 'cpu',
strict: bool = False,
load_module_only: bool = True, # add this
revise_keys: list = [(r'^module.', '')],
callback: Optional[Callable] = None,
) -> dict:
"""Load checkpoint from given ``filename``.
Warning:
`map_localtion` and `callback` parameters are not supported yet.
Args:
filename (str): Accept local filepath, URL, ``torchvision://xxx``,
``open-mmlab://xxx``.
"""
self.logger.info(f'Load checkpoint from {filename}')
dirname, basename = osp.split(filename)
_, extra_ckpt = self.model.load_checkpoint(
dirname, tag=basename, load_optimizer_states=False,
load_module_strict=strict, load_module_only=load_module_only) # add this
Hi @mypydl , thanks for your report.
- assign strict to load_module_strict
- need to consider how to pass the strict and load_module_only parameters to
load_checkpoint
because FlexibleRunner does not accept these setting. - If adding a new
load_module_only
to theload_checkpoint
of DeepSpeedStrategy, maybe also need to add it to other strategies just for interface unified.
If the model is partitioned, loading checkpoint before _wrap_model
can not work well.