GPLinker_pytorch icon indicating copy to clipboard operation
GPLinker_pytorch copied to clipboard

大佬,用单GPU没有报错,但是用Accclerate跑双GPU就报如下的错误:

Open fmdmm opened this issue 2 years ago • 0 comments

看到:“batch_outputs[0].cpu().numpy()”这块,但是单gpu为啥就没有问题,想不通呀 image

具体报错如下: Training: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 640/640 [08:48<00:00, 1.22it/s]##--------------------- Dev --------------------------------------------------------------------------------
f1 = 0.20078740157511782 precision = 0.22666666666701038 recall = 0.1802120141345653

**--------------------- Dev End Traceback (most recent call last): File "train.py", line 365, in main() File "train.py", line 322, in main dev_metric = evaluate( File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "train.py", line 48, in evaluate outputs_gathered = postprocess_gplinker( File "/sharedFolder/GPLinker_pytorch-dev/utils/postprocess.py", line 8, in postprocess_gplinker batch_outputs[0].cpu().numpy(), RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f736ca5fd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #1: + 0x1c4d3 (0x7f736ccc24d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7f736ccc2ee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f736ca49314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #4: + 0x29e239 (0x7f73c9422239 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #5: + 0xadf291 (0x7f73c9c63291 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #6: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f73c9c63592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #7: /usr/bin/python3() [0x5aee8a] frame #8: /usr/bin/python3() [0x5ed1a0] frame #9: /usr/bin/python3() [0x544188] frame #10: /usr/bin/python3() [0x5441da] frame #11: /usr/bin/python3() [0x5441da] frame #12: /usr/bin/python3() [0x5441da] frame #13: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3) frame #14: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3) frame #15: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3) frame #16: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3) frame #17: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3) frame #18: __libc_start_main + 0xf3 (0x7f73db4910b3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #19: _start + 0x2e (0x5fb12e in /usr/bin/python3)

Traceback (most recent call last): File "train.py", line 365, in main() File "train.py", line 352, in main accelerator.wait_for_everyone() File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 496, in wait_for_everyone wait_for_everyone() File "/usr/local/lib/python3.8/dist-packages/accelerate/utils.py", line 530, in wait_for_everyone torch.distributed.barrier() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2716, in barrier work.wait() RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent) terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd613fdbd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #1: + 0x1c4d3 (0x7fd61423e4d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7fd61423eee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fd613fc5314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so) frame #4: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x4a (0x7fd670de549a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #5: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x63 (0x7fd617714f33 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7fd6177150c9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xe6c6d6 (0x7fd67156c6d6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #8: + 0xe6c72a (0x7fd67156c72a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #9: + 0x2a6c10 (0x7fd6709a6c10 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #10: + 0x2a7e7e (0x7fd6709a7e7e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so) frame #11: /usr/bin/python3() [0x5ed1a0] frame #12: /usr/bin/python3() [0x544188] frame #13: /usr/bin/python3() [0x5441da] frame #14: /usr/bin/python3() [0x5441da] frame #15: /usr/bin/python3() [0x5441da] frame #16: /usr/bin/python3() [0x5441da] frame #17: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3) frame #18: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3) frame #19: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3) frame #20: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3) frame #21: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3) frame #22: __libc_start_main + 0xf3 (0x7fd682a0c0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #23: _start + 0x2e (0x5fb12e in /usr/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31892) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2022-07-19_03:27:00 host : 7fc5b780751c rank : 1 (local_rank: 1) exitcode : -6 (pid: 31893) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 31893

Root Cause (first observed failure): [0]: time : 2022-07-19_03:27:00 host : 7fc5b780751c rank : 0 (local_rank: 0) exitcode : -6 (pid: 31892) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 31892

Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main args.func(args) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 378, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '2', 'train.py', '--model_type', 'bert', '--pretrained_model_name_or_path', 'bert-base-chinese', '--method', 'gplinker', '--logging_steps', '200', '--num_train_epochs', '20', '--learning_rate', '3e-5', '--num_warmup_steps_or_radios', '0.1', '--gradient_accumulation_steps', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '32', '--seed', '42', '--save_steps', '10804', '--output_dir', './outputs', '--max_length', '128', '--topk', '1', '--num_workers', '8', '--model_cache_dir', '/mnt/f/hf/models']' returned non-zero exit status 1.

fmdmm avatar Jul 19 '22 07:07 fmdmm