ChatGLM-6B [BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config. Map: 20%|████████████████████████▉ | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in │ │ │ │ 437 │ │ 438 │ │ 439 if name == "main": │ │ ❱ 440 │ main() │ │ 441 │ │ │ │ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:251 in main │ │ │ │ 248 │ │ if data_args.max_train_samples is not None: │ │ 249 │ │ │ max_train_samples = min(len(train_dataset), data_args.max_train_samples) │ │ 250 │ │ │ train_dataset = train_dataset.select(range(max_train_samples)) │ │ ❱ 251 │ │ with training_args.main_process_first(desc="train dataset map pre-processing"): │ │ 252 │ │ │ │ │ 253 │ │ │ train_dataset = train_dataset.map( │ │ 254 │ │ │ │ preprocess_function_train, │ │ │ │ /opt/conda/lib/python3.10/contextlib.py:135 in enter │ │ │ │ 132 │ │ # they are only needed for recreation, which is not possible anymore │ │ 133 │ │ del self.args, self.kwds, self.func │ │ 134 │ │ try: │ │ ❱ 135 │ │ │ return next(self.gen) │ │ 136 │ │ except StopIteration: │ │ 137 │ │ │ raise RuntimeError("generator didn't yield") from None │ │ 138 │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1888 in main_process_first │ │ │ │ 1885 │ │ │ │ │ elif is_sagemaker_dp_enabled(): │ │ 1886 │ │ │ │ │ │ dist.barrier() │ │ 1887 │ │ │ │ │ else: │ │ ❱ 1888 │ │ │ │ │ │ torch.distributed.barrier() │ │ 1889 │ │ │ │ yield │ │ 1890 │ │ │ finally: │ │ 1891 │ │ │ │ if is_main_process: │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:3145 in barrier │ │ │ │ 3142 │ │ │ 3143 │ if group is None: │ │ 3144 │ │ default_pg = _get_default_group() │ │ ❱ 3145 │ │ work = default_pg.barrier(opts=opts) │ │ 3146 │ else: │ │ 3147 │ │ work = group.barrier(opts=opts) │ │ 3148 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #9: + 0x1be3a3 (0x7f79d31fd3a3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so) frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f79d31fe721 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)

Expected Behavior

No response

Steps To Reproduce

1、ptuning数据133万条 2、两张A100显卡

Environment

- OS:
- Python:3.10.8
- Transformers: 4.28.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True


1、ptuning数据133万条
2、两张A100显卡

Anything else?

No response

Jun 13 '23 02:06 zxy333666

同问，出现了同样的问题~

Jun 25 '23 17:06 lhy101

同问，出现了同样的问题~

我是因为max_source_length和max_target_length设得太大了才出现这个问题

Jun 26 '23 06:06 zzoneee

ChatGLM-6B ChatGLM-6B copied to clipboard

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard