exo icon indicating copy to clipboard operation
exo copied to clipboard

"failed to connect to all addresses; last error: UNAVAILABLE: ipv4:127.0.0.1:7897: Socket closed"

Open lesong36 opened this issue 1 year ago • 2 comments

(.venv) (base) coty@P16:~/OneDrive/LLM/repo/exo$ ^C (.venv) (base) coty@P16:~/OneDrive/LLM/repo/exo$ ^C (.venv) (base) coty@P16:~/OneDrive/LLM/repo/exo$ DEBUG=9 python3 main.py None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


/ _ \ / / _ \ | /> < (_) | _/_/____/

Detected system: Linux Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader Trying to find available port port=50355 [60304, 55379, 57624, 60258, 57340, 58850, 53290, 55123, 57105, 59823, 50717] Using available port: 50355 Retrieved existing node ID: d639030c-62f3-47c5-bc1f-0ee22be53e67 Chat interface started:

  • http://172.17.0.1:8000
  • http://192.168.0.116:8000
  • http://192.168.0.109:8000
  • http://127.0.0.1:8000 ChatGPT API endpoint served at:
  • http://172.17.0.1:8000/v1/chat/completions
  • http://192.168.0.116:8000/v1/chat/completions
  • http://192.168.0.109:8000/v1/chat/completions
  • http://127.0.0.1:8000/v1/chat/completions tinygrad Device.DEFAULT='NV' NVIDIA device gpu_name='NVIDIA RTX 5000 ADA GENERATION LAPTOP GPU' gpu_memory_info=<pynvml.nvml.c_nvmlMemory_t object at 0x725cb63123d0> Server started, listening on 0.0.0.0:50355 tinygrad Device.DEFAULT='NV' NVIDIA device gpu_name='NVIDIA RTX 5000 ADA GENERATION LAPTOP GPU' gpu_memory_info=<pynvml.nvml.c_nvmlMemory_t object at 0x725c9efc9050> Collecting topology max_depth=4 visited=set() Collected topology: Topology(Nodes: {d639030c-62f3-47c5-bc1f-0ee22be53e67: Model: Linux Box (NVIDIA RTX 5000 ADA GENERATION LAPTOP GPU). Chip: NVIDIA RTX 5000 ADA GENERATION LAPTOP GPU. Memory: 16376MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS}, Edges: {}) Collecting topology max_depth=3 visited={'d639030c-62f3-47c5-bc1f-0ee22be53e67'} CollectTopology max_depth=3 visited={'d639030c-62f3-47c5-bc1f-0ee22be53e67'} nodes={'d639030c-62f3-47c5-bc1f-0ee22be53e67': model: "Linux Box (NVIDIA RTX 5000 ADA GENERATION LAPTOP GPU)" chip: "NVIDIA RTX 5000 ADA GENERATION LAPTOP GPU" memory: 16376 flops { } } peer_graph={} Connecting to 618e0af0-8eef-4153-9032-41d1d821b2c3... Connected to peer Model: Linux Box (NVIDIA RTX A6000). Chip: NVIDIA RTX A6000. Memory: 49140MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS (peer.id()='618e0af0-8eef-4153-9032-41d1d821b2c3') Collecting topology max_depth=4 visited=set() Error collecting topology from 618e0af0-8eef-4153-9032-41d1d821b2c3: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:127.0.0.1:7897: Socket closed" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-13T01:36:37.527367082+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNAVAILABLE: ipv4:127.0.0.1:7897: Socket closed"}"

Connecting to 618e0af0-8eef-4153-9032-41d1d821b2c3... Connected to peer Model: Linux Box (NVIDIA RTX A6000). Chip: NVIDIA RTX A6000. Memory: 49140MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS (peer.id()='618e0af0-8eef-4153-9032-41d1d821b2c3') Collecting topology max_depth=4 visited={'618e0af0-8eef-4153-9032-41d1d821b2c3'} Connecting to 618e0af0-8eef-4153-9032-41d1d821b2c3... Connected to peer Model: Linux Box (NVIDIA RTX A6000). Chip: NVIDIA RTX A6000. Memory: 49140MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS (peer.id()='618e0af0-8eef-4153-9032-41d1d821b2c3') Collecting topology max_depth=4 visited={'618e0af0-8eef-4153-9032-41d1d821b2c3'} Connecting to 618e0af0-8eef-4153-9032-41d1d821b2c3... Connected to peer Model: Linux Box (NVIDIA RTX A6000). Chip: NVIDIA RTX A6000. Memory: 49140MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS (peer.id()='618e0af0-8eef-4153-9032-41d1d821b2c3') Collecting topology max_depth=4 visited={'618e0af0-8eef-4153-9032-41d1d821b2c3'} Connecting to 618e0af0-8eef-4153-9032-41d1d821b2c3... Connected to peer Model: Linux Box (NVIDIA RTX A6000). Chip: NVIDIA RTX A6000. Memory: 49140MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS (peer.id()='618e0af0-8eef-4153-9032-41d1d821b2c3') Collecting topology max_depth=4 visited={'618e0af0-8eef-4153-9032-41d1d821b2c3'} Connecting to 618e0af0-8eef-4153-9032-41d1d821b2c3...

lesong36 avatar Aug 12 '24 17:08 lesong36

Tl;dr: We need more robust connection management. One annoying issue right now after we introduced sticky node ids is that if a node restarts and changes its ephemeral port, other nodes may still try to talk to it on the previous port assigned to that node id.

The good thing is this is all pretty easy to fix just requires a small refactor of networking.

AlexCheema avatar Aug 12 '24 18:08 AlexCheema

Tl;dr: We need more robust connection management. One annoying issue right now after we introduced sticky node ids is that if a node restarts and changes its ephemeral port, other nodes may still try to talk to it on the previous port assigned to that node id.

The good thing is this is all pretty easy to fix just requires a small refactor of networking.

Thanks for your answer. Anything I can do for this issue?

lesong36 avatar Aug 13 '24 01:08 lesong36