ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Open zxy333666 opened this issue 1 year ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config. Map:  20%|████████████████████████▉                                                                                                 | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in                              │ │                                                                                                  │ │   437                                                                                            │ │   438                                                                                            │ │   439 if name == "main":                                                                 │ │ ❱ 440 │   main()                                                                                 │ │   441                                                                                            │ │                                                                                                  │ │ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:251 in main                                 │ │                                                                                                  │ │   248 │   │   if data_args.max_train_samples is not None:                                        │ │   249 │   │   │   max_train_samples = min(len(train_dataset), data_args.max_train_samples)       │ │   250 │   │   │   train_dataset = train_dataset.select(range(max_train_samples))                 │ │ ❱ 251 │   │   with training_args.main_process_first(desc="train dataset map pre-processing"):    │ │   252 │   │   │                                                                                  │ │   253 │   │   │   train_dataset = train_dataset.map(                                             │ │   254 │   │   │   │   preprocess_function_train,                                                 │ │                                                                                                  │ │ /opt/conda/lib/python3.10/contextlib.py:135 in enter                                         │ │                                                                                                  │ │   132 │   │   # they are only needed for recreation, which is not possible anymore               │ │   133 │   │   del self.args, self.kwds, self.func                                                │ │   134 │   │   try:                                                                               │ │ ❱ 135 │   │   │   return next(self.gen)                                                          │ │   136 │   │   except StopIteration:                                                              │ │   137 │   │   │   raise RuntimeError("generator didn't yield") from None                         │ │   138                                                                                            │ │                                                                                                  │ │ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1888 in main_process_first │ │                                                                                                  │ │   1885 │   │   │   │   │   elif is_sagemaker_dp_enabled():                                       │ │   1886 │   │   │   │   │   │   dist.barrier()                                                    │ │   1887 │   │   │   │   │   else:                                                                 │ │ ❱ 1888 │   │   │   │   │   │   torch.distributed.barrier()                                       │ │   1889 │   │   │   │   yield                                                                     │ │   1890 │   │   │   finally:                                                                      │ │   1891 │   │   │   │   if is_main_process:                                                       │ │                                                                                                  │ │ /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:3145 in barrier    │ │                                                                                                  │ │   3142 │                                                                                         │ │   3143 │   if group is None:                                                                     │ │   3144 │   │   default_pg = _get_default_group()                                                 │ │ ❱ 3145 │   │   work = default_pg.barrier(opts=opts)                                              │ │   3146 │   else:                                                                                 │ │   3147 │   │   work = group.barrier(opts=opts)                                                   │ │   3148                                                                                           │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #9: + 0x1be3a3 (0x7f79d31fd3a3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f79d31fe721 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)

Expected Behavior

No response

Steps To Reproduce

1、ptuning数据133万条 2、两张A100显卡

Environment

- OS:
- Python:3.10.8
- Transformers: 4.28.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True


1、ptuning数据133万条
2、两张A100显卡

Anything else?

No response

zxy333666 avatar Jun 13 '23 02:06 zxy333666

同问,出现了同样的问题~

lhy101 avatar Jun 25 '23 17:06 lhy101

同问,出现了同样的问题~

我是因为max_source_length和max_target_length设得太大了才出现这个问题

zzoneee avatar Jun 26 '23 06:06 zzoneee