mmengine icon indicating copy to clipboard operation
mmengine copied to clipboard

[Bug] `misaligned address` during in `SyncBuffersHook` all_reduce when using bf16 with deepspeed

Open SCZwangxiao opened this issue 1 year ago • 1 comments

Prerequisite

  • [X] I have searched Issues and Discussions but cannot get the expected help.
  • [X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

  • Env in logs:
System environment:
    sys.platform: linux
    Python: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0]
    CUDA available: True
    numpy_random_seed: 950529031
    GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.1, V12.1.105
    GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
    PyTorch: 2.1.2+cu121
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.5  (built against CUDA 11.7)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.16.2+cu118
    OpenCV: 4.8.1
    MMEngine: 0.10.2

Runtime environment:
    launcher: pytorch
    randomness: {'seed': None}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 8
  • Output of python -c "from mmengine.utils.dl_utils import collect_env; print(collect_env())":
OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1,2,3,4,5,6,7', 'NVIDIA H100 80GB HBM3'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 12.1, V12.1.105'), ('GCC', 'x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0'), ('PyTorch', '2.1.2+cu121'), ('PyTorch compiling details', 'PyTorch built with:\n  - GCC 9.3\n  - C++ Version: 201703\n  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: AVX512\n  - CUDA Runtime 12.1\n  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n  - CuDNN 8.5  (built against CUDA 11.7)\n    - Built with CuDNN 8.9.2\n  - Magma 2.6.1\n  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.16.2+cu118'), ('OpenCV', '4.8.1'), ('MMEngine', '0.10.2')])

Reproduces the problem - code sample

The bug is very strange. I have not found the minimal reproducible code yet. There are some strange observations:

  1. The bug consistently appears when I change my machines from A800 to H800. The docker is unchanged.
  2. Only occurs under bfloat16.
  3. misalign address error only occurs after the epoch due to SyncBuffersHook.
  4. The bug disappears when I delete SyncBuffersHook.

I was fine-tuning LLaVA. The buffers includes rope embeddings.

Reproduces the problem - command or script

See above

Reproduces the problem - error message

06/17 19:51:14 - mmengine - INFO - Epoch(train) [1][ 8/10]  base_lr: 1.5433e-03 lr: 1.5433e-03  eta: 0:00:03  time: 1.6954  data_time: 0.0657  memory: 20203  image/loss: 8.3296
06/17 19:51:15 - mmengine - INFO - Epoch(train) [1][ 9/10]  base_lr: 7.3223e-04 lr: 7.3223e-04  eta: 0:00:01  time: 1.6479  data_time: 0.0588  memory: 20213  image/loss: 8.2626
06/17 19:51:16 - mmengine - INFO - Exp name: llava_vitl-14_336_7b_pt_CSC_16xb16_zero2_20240617_194817
06/17 19:51:16 - mmengine - INFO - Epoch(train) [1][10/10]  base_lr: 1.9030e-04 lr: 1.9030e-04  eta: 0:00:00  time: 1.6133  data_time: 0.0533  memory: 20310  image/loss: 8.2316

aiplatform-wlf2-ge11-33:19124:19124 [0] enqueue.cc:1087 NCCL WARN Cuda failure 'misaligned address'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 143, in <module>
    main()
  File "tools/train.py", line 139, in main
    runner.train()
  File "/home/wangxiao24/dev_videochat/kvchat/engine/runner/kvchat_runner.py", line 260, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/home/wangxiao24/dev_videochat/kvchat/engine/runner/video_pt_loop.py", line 80, in run_epoch
    self.runner.call_hook('after_train_epoch')
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/_flexible_runner.py", line 1271, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/hooks/sync_buffer_hook.py", line 42, in after_train_epoch
    all_reduce_params(runner.model.buffers(), op='mean')
  File "/usr/local/lib/python3.8/dist-packages/mmengine/dist/dist.py", line 1160, in all_reduce_params
    _all_reduce_coalesced(params_data, bucket_size_mb, op=op, group=group)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/dist/dist.py", line 1108, in _all_reduce_coalesced
    all_reduce(flat_tensors, op=op, group=group)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/dist/dist.py", line 98, in all_reduce
    torch_dist.all_reduce(data_on_device, _get_reduce_op('sum'), group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 52, in wrapper
    "args": f"{args}, {kwargs}",
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 431, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 664, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 595, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 329, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Additional information

No response

SCZwangxiao avatar Jun 17 '24 12:06 SCZwangxiao

hi xiao, have you fixed this error? I met the same error msg when using accelerate integrated with deepspeed, and I cannot find useful solutions to fix this problem, do you have any update?

fangchuan avatar Jan 06 '25 08:01 fangchuan