ChatGLM-6B
ChatGLM-6B copied to clipboard
[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config.
Map: 20%|████████████████████████▉ | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9:
Expected Behavior
No response
Steps To Reproduce
1、ptuning数据133万条 2、两张A100显卡
Environment
- OS:
- Python:3.10.8
- Transformers: 4.28.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
1、ptuning数据133万条
2、两张A100显卡
Anything else?
No response
同问,出现了同样的问题~
同问,出现了同样的问题~
我是因为max_source_length和max_target_length设得太大了才出现这个问题