opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] Error when using multiple GPUs

Open HelanHu opened this issue 1 year ago • 0 comments

先决条件

  • [X] 我已经搜索过 问题讨论 但未得到预期的帮助。
  • [X] 错误在 最新版本 中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))" {'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda', 'GCC': 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0', 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A40', 'MMEngine': '0.10.2', 'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89', 'OpenCV': '4.9.0', 'PyTorch': '2.1.1', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2023.1-Product Build 20230303 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.1.1 (Git Hash ' '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX512\n' ' - CUDA Runtime 11.8\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n' ' - CuDNN 8.7\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, CUDA_VERSION=11.8, ' 'CUDNN_VERSION=8.7.0, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM ' '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK ' '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE ' '-O2 -fPIC -Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wno-psabi ' '-Wno-error=pedantic -Wno-error=old-style-cast ' '-Wno-invalid-partial-specialization ' '-Wno-unused-private-field ' '-Wno-aligned-allocation-unavailable ' '-Wno-missing-braces -fdiagnostics-color=always ' '-faligned-new -Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Werror=cast-function-type ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, ' 'TORCH_DISABLE_GPU_ASSERTS=ON, ' 'TORCH_VERSION=2.1.1, USE_CUDA=ON, USE_CUDNN=ON, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, \n', 'Python': '3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]', 'TorchVision': '0.16.1', 'numpy_random_seed': 2147483648, 'opencompass': '0.2.1+4f78388', 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

from mmengine.config import read_base from opencompass.models import HuggingFaceCausalLM

with read_base(): from .datasets.ARC_c.ARC_c_gen import ARC_c_datasets datasets = [*ARC_c_datasets] models = [ dict( type=HuggingFaceCausalLM, abbr='llama-2-70b-hf', path="xxx/models/Llama-2-70b-hf", tokenizer_path='xxx/models/Llama-2-70b-hf', tokenizer_kwargs=dict(padding_side='left', truncation_side='left', use_fast=False, ), max_out_len=100, max_seq_len=2048, batch_size=1, model_kwargs=dict(device_map='auto'), batch_padding=False, # if false, inference with for-loop without batch padding run_cfg=dict(num_gpus=4, num_procs=1), ) ]

重现问题 - 命令或脚本

python run.py /home/hhl/evaluation/opencompass/configs/eval_arc.py --debug

重现问题 - 错误信息

01/30 00:18:27 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored. 01/30 00:18:27 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners 01/30 00:18:27 - OpenCompass - DEBUG - Get class SizePartitioner from "partitioner" registry in "opencompass" 01/30 00:18:27 - OpenCompass - DEBUG - An SizePartitioner instance is built from registry, and its implementation can be found in opencompass.partitioners.size 01/30 00:18:27 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored. 01/30 00:18:27 - OpenCompass - DEBUG - Key eval.runner.task.dump_details not found in config, ignored. 01/30 00:18:27 - OpenCompass - DEBUG - Additional config: {} 01/30 00:18:27 - OpenCompass - INFO - Partitioned into 1 tasks. 01/30 00:18:27 - OpenCompass - DEBUG - Task 0: [llama-2-70b-hf/ARC-c] 01/30 00:18:27 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners 01/30 00:18:27 - OpenCompass - DEBUG - Get class LocalRunner from "runner" registry in "opencompass" 01/30 00:18:27 - OpenCompass - DEBUG - An LocalRunner instance is built from registry, and its implementation can be found in opencompass.runners.local 01/30 00:18:27 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks 01/30 00:18:27 - OpenCompass - DEBUG - Get class OpenICLInferTask from "task" registry in "opencompass" 01/30 00:18:27 - OpenCompass - DEBUG - An OpenICLInferTask instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_infer 01/30 00:18:31 - OpenCompass - INFO - Task [llama-2-70b-hf/ARC-c] 01/30 00:18:32 - OpenCompass - WARNING - pad_token_id is not set for the tokenizer. 01/30 00:18:32 - OpenCompass - WARNING - Using eos_token_id as pad_token_id.

Loading checkpoint shards: 0%| | 0/15 [00:00<?, ?it/s] Loading checkpoint shards: 7%|▋ | 1/15 [00:01<00:22, 1.61s/it] Loading checkpoint shards: 13%|█▎ | 2/15 [00:03<00:22, 1.76s/it] Loading checkpoint shards: 20%|██ | 3/15 [00:05<00:20, 1.72s/it] Loading checkpoint shards: 27%|██▋ | 4/15 [00:06<00:19, 1.76s/it] Loading checkpoint shards: 33%|███▎ | 5/15 [00:08<00:17, 1.70s/it] Loading checkpoint shards: 40%|████ | 6/15 [00:10<00:15, 1.71s/it] Loading checkpoint shards: 47%|████▋ | 7/15 [00:12<00:13, 1.73s/it] Loading checkpoint shards: 53%|█████▎ | 8/15 [00:13<00:12, 1.74s/it] Loading checkpoint shards: 60%|██████ | 9/15 [00:15<00:10, 1.70s/it] Loading checkpoint shards: 67%|██████▋ | 10/15 [00:17<00:08, 1.66s/it] Loading checkpoint shards: 73%|███████▎ | 11/15 [00:18<00:06, 1.70s/it] Loading checkpoint shards: 80%|████████ | 12/15 [00:20<00:05, 1.68s/it] Loading checkpoint shards: 87%|████████▋ | 13/15 [00:22<00:03, 1.65s/it] Loading checkpoint shards: 93%|█████████▎| 14/15 [00:23<00:01, 1.63s/it] Loading checkpoint shards: 100%|██████████| 15/15 [00:23<00:00, 1.20s/it] Loading checkpoint shards: 100%|██████████| 15/15 [00:23<00:00, 1.59s/it] 01/30 00:18:58 - OpenCompass - INFO - Start inferencing [llama-2-70b-hf/ARC-c] [2024-01-30 00:18:59,241] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...

0%| | 0/1165 [00:00<?, ?it/s]/xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample is set to False. However, temperature is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. warnings.warn( /xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample is set to False. However, top_p is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. warnings.warn( /opt/conda/conda-bld/pytorch_1699449181202/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [17,0,0], thread: [32,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1699449181202/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [17,0,0], thread: [33,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. .... /opt/conda/conda-bld/pytorch_1699449181202/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

0%| | 0/1165 [00:00<?, ?it/s] Traceback (most recent call last): File "xxx/opencompass/opencompass/tasks/openicl_infer.py", line 153, in inferencer.run() File "xxx/opencompass/opencompass/tasks/openicl_infer.py", line 81, in run self._inference() File "xxx/opencompass/opencompass/tasks/openicl_infer.py", line 126, in _inference inferencer.inference(retriever, File "/xxx/opencompass/opencompass/openicl/icl_inferencer/icl_gen_inferencer.py", line 146, in inference results = self.model.generate_from_template( File "xxx/opencompass/opencompass/models/base.py", line 165, in generate_from_template return self.generate(inputs, max_out_len=max_out_len, **kwargs) File "xxx/opencompass/opencompass/models/huggingface.py", line 250, in generate return sum( File "xxx/opencompass/opencompass/models/huggingface.py", line 251, in (self.single_generate(inputs=[input], File "xxx/opencompass/opencompass/models/huggingface.py", line 407, in _single_generate outputs = self.model.generate(input_ids=input_ids, File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py", line 1596, in generate return self.greedy_search( File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py", line 2444, in greedy_search outputs = self( File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 809, in forward outputs = self.model( File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 697, in forward layer_outputs = decoder_layer( File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 322, in forward query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 184, in apply_rotary_pos_emb cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-01-30 00:19:04,249] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2906944) of binary: xxx/anaconda3/envs/opencompass/bin/python Traceback (most recent call last): File "xxx/anaconda3/envs/opencompass/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.1.1', 'console_scripts', 'torchrun')()) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

xxx/evaluation/opencompass/opencompass/tasks/openicl_infer.py FAILED

其他信息

When I use num_gpu=1, there is no error, but when I use num_gpu>1 I get this error. I suspect it's a problem with the transformers library, but after trying switching through several versions, I still can't fix it.

I want to use llama2 with 70b, I tried meta/llama and I only have 4 GPUs with a capacity of 44G, but it seems that loading meta/llama 70b only allows me to use 8 cards.

I think it's a compatibility issue with your code and the transformers library?

I also tried to avoid this problem by using a non-huggingface mod, I want to use llama2 with 70b, I tried meta/llama and I only have 4 GPUs with 44G capacity, but it seems that loading meta/llama 70b can only use 8 cards.

So, hopefully this bug can be fixed soon

HelanHu avatar Jan 29 '24 16:01 HelanHu