opencompass
opencompass copied to clipboard
[Bug] Error when using multiple GPUs
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))" {'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda', 'GCC': 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0', 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A40', 'MMEngine': '0.10.2', 'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89', 'OpenCV': '4.9.0', 'PyTorch': '2.1.1', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2023.1-Product Build 20230303 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.1.1 (Git Hash ' '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX512\n' ' - CUDA Runtime 11.8\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n' ' - CuDNN 8.7\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, CUDA_VERSION=11.8, ' 'CUDNN_VERSION=8.7.0, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM ' '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK ' '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE ' '-O2 -fPIC -Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wno-psabi ' '-Wno-error=pedantic -Wno-error=old-style-cast ' '-Wno-invalid-partial-specialization ' '-Wno-unused-private-field ' '-Wno-aligned-allocation-unavailable ' '-Wno-missing-braces -fdiagnostics-color=always ' '-faligned-new -Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Werror=cast-function-type ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, ' 'TORCH_DISABLE_GPU_ASSERTS=ON, ' 'TORCH_VERSION=2.1.1, USE_CUDA=ON, USE_CUDNN=ON, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, \n', 'Python': '3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]', 'TorchVision': '0.16.1', 'numpy_random_seed': 2147483648, 'opencompass': '0.2.1+4f78388', 'sys.platform': 'linux'}
重现问题 - 代码/配置示例
from mmengine.config import read_base from opencompass.models import HuggingFaceCausalLM
with read_base(): from .datasets.ARC_c.ARC_c_gen import ARC_c_datasets datasets = [*ARC_c_datasets] models = [ dict( type=HuggingFaceCausalLM, abbr='llama-2-70b-hf', path="xxx/models/Llama-2-70b-hf", tokenizer_path='xxx/models/Llama-2-70b-hf', tokenizer_kwargs=dict(padding_side='left', truncation_side='left', use_fast=False, ), max_out_len=100, max_seq_len=2048, batch_size=1, model_kwargs=dict(device_map='auto'), batch_padding=False, # if false, inference with for-loop without batch padding run_cfg=dict(num_gpus=4, num_procs=1), ) ]
重现问题 - 命令或脚本
python run.py /home/hhl/evaluation/opencompass/configs/eval_arc.py --debug
重现问题 - 错误信息
01/30 00:18:27 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
01/30 00:18:27 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners
01/30 00:18:27 - OpenCompass - DEBUG - Get class SizePartitioner
from "partitioner" registry in "opencompass"
01/30 00:18:27 - OpenCompass - DEBUG - An SizePartitioner
instance is built from registry, and its implementation can be found in opencompass.partitioners.size
01/30 00:18:27 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored.
01/30 00:18:27 - OpenCompass - DEBUG - Key eval.runner.task.dump_details not found in config, ignored.
01/30 00:18:27 - OpenCompass - DEBUG - Additional config: {}
01/30 00:18:27 - OpenCompass - INFO - Partitioned into 1 tasks.
01/30 00:18:27 - OpenCompass - DEBUG - Task 0: [llama-2-70b-hf/ARC-c]
01/30 00:18:27 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners
01/30 00:18:27 - OpenCompass - DEBUG - Get class LocalRunner
from "runner" registry in "opencompass"
01/30 00:18:27 - OpenCompass - DEBUG - An LocalRunner
instance is built from registry, and its implementation can be found in opencompass.runners.local
01/30 00:18:27 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks
01/30 00:18:27 - OpenCompass - DEBUG - Get class OpenICLInferTask
from "task" registry in "opencompass"
01/30 00:18:27 - OpenCompass - DEBUG - An OpenICLInferTask
instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_infer
01/30 00:18:31 - OpenCompass - INFO - Task [llama-2-70b-hf/ARC-c]
01/30 00:18:32 - OpenCompass - WARNING - pad_token_id is not set for the tokenizer.
01/30 00:18:32 - OpenCompass - WARNING - Using eos_token_id as pad_token_id.
Loading checkpoint shards: 0%| | 0/15 [00:00<?, ?it/s] Loading checkpoint shards: 7%|▋ | 1/15 [00:01<00:22, 1.61s/it] Loading checkpoint shards: 13%|█▎ | 2/15 [00:03<00:22, 1.76s/it] Loading checkpoint shards: 20%|██ | 3/15 [00:05<00:20, 1.72s/it] Loading checkpoint shards: 27%|██▋ | 4/15 [00:06<00:19, 1.76s/it] Loading checkpoint shards: 33%|███▎ | 5/15 [00:08<00:17, 1.70s/it] Loading checkpoint shards: 40%|████ | 6/15 [00:10<00:15, 1.71s/it] Loading checkpoint shards: 47%|████▋ | 7/15 [00:12<00:13, 1.73s/it] Loading checkpoint shards: 53%|█████▎ | 8/15 [00:13<00:12, 1.74s/it] Loading checkpoint shards: 60%|██████ | 9/15 [00:15<00:10, 1.70s/it] Loading checkpoint shards: 67%|██████▋ | 10/15 [00:17<00:08, 1.66s/it] Loading checkpoint shards: 73%|███████▎ | 11/15 [00:18<00:06, 1.70s/it] Loading checkpoint shards: 80%|████████ | 12/15 [00:20<00:05, 1.68s/it] Loading checkpoint shards: 87%|████████▋ | 13/15 [00:22<00:03, 1.65s/it] Loading checkpoint shards: 93%|█████████▎| 14/15 [00:23<00:01, 1.63s/it] Loading checkpoint shards: 100%|██████████| 15/15 [00:23<00:00, 1.20s/it] Loading checkpoint shards: 100%|██████████| 15/15 [00:23<00:00, 1.59s/it] 01/30 00:18:58 - OpenCompass - INFO - Start inferencing [llama-2-70b-hf/ARC-c] [2024-01-30 00:18:59,241] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
0%| | 0/1165 [00:00<?, ?it/s]/xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: do_sample
is set to False
. However, temperature
is set to 0.6
-- this flag is only used in sample-based generation modes. You should set do_sample=True
or unset temperature
.
warnings.warn(
/xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: do_sample
is set to False
. However, top_p
is set to 0.9
-- this flag is only used in sample-based generation modes. You should set do_sample=True
or unset top_p
.
warnings.warn(
/opt/conda/conda-bld/pytorch_1699449181202/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [17,0,0], thread: [32,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1699449181202/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [17,0,0], thread: [33,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
....
/opt/conda/conda-bld/pytorch_1699449181202/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.
0%| | 0/1165 [00:00<?, ?it/s]
Traceback (most recent call last):
File "xxx/opencompass/opencompass/tasks/openicl_infer.py", line 153, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
[2024-01-30 00:19:04,249] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2906944) of binary: xxx/anaconda3/envs/opencompass/bin/python
Traceback (most recent call last):
File "xxx/anaconda3/envs/opencompass/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.1', 'console_scripts', 'torchrun')())
File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "xxx/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
xxx/evaluation/opencompass/opencompass/tasks/openicl_infer.py FAILED
其他信息
When I use num_gpu=1, there is no error, but when I use num_gpu>1 I get this error. I suspect it's a problem with the transformers library, but after trying switching through several versions, I still can't fix it.
I want to use llama2 with 70b, I tried meta/llama and I only have 4 GPUs with a capacity of 44G, but it seems that loading meta/llama 70b only allows me to use 8 cards.
I think it's a compatibility issue with your code and the transformers library?
I also tried to avoid this problem by using a non-huggingface mod, I want to use llama2 with 70b, I tried meta/llama and I only have 4 GPUs with 44G capacity, but it seems that loading meta/llama 70b can only use 8 cards.
So, hopefully this bug can be fixed soon