(Bug?) `ValueError: size mismatched, can't reshape self.shape=(1, 25, 128256, 4096) -> new_shape=(1, 25, 32, 128)`
I have a cluster with 3 machines:
- Ubuntu Linux 22.04 with 32GB RAM + Quadro RTX 4000
- Ubuntu Linux 22.04 with 64GB RAM + Quadro RTX 5000
- M2 MacOS 14.7 with 32GB of RAM
I still couldn't get the Linux nodes to show their TFLOPS, it's still showing as 0 (zero), but it doesn't seem to be related to the issue (though maybe I'm wrong?). AFAIK, nvidia-cuda is installed and working (via nvidia-cuda-toolkit apt package, I'm using the v560 (open) driver from nvidia).
I'm trying to run Llama 3.1 8B. As soon as I open tinychat in any of the nodes and start typing with Llama 8B selected, after a while RTX 5000 node fails with:
(...)
ram used: 9.16 GB, layers.29.attention_norm.weight : 92%|█████████████████████████████████████████████████████████████████████████████████████▋ | 269/292 [00:05<00:00, 49.12it/s]
ram used: 9.16 GB, layers.29.ffn_norm.weight : 92%|█████████████████████████████████████████████████████████████████████████████████████▉ | 270/292 [00:05<00:00, 49.26it/s]
ram used: 9.16 GB, layers.30.attention.wq.weight : 93%|██████████████████████████████████████████████████████████████████████████████████████▎ | 271/292 [00:05<00:00, 49.40it/s]
ram used: 9.16 GB, layers.30.attention.wk.weight : 93%|██████████████████████████████████████████████████████████████████████████████████████▋ | 272/292 [00:05<00:00, 49.54it/s]
ram used: 9.16 GB, layers.30.attention.wv.weight : 93%|██████████████████████████████████████████████████████████████████████████████████████▉ | 273/292 [00:05<00:00, 49.69it/s]
ram used: 9.16 GB, layers.30.attention.wo.weight : 94%|███████████████████████████████████████████████████████████████████████████████████████▎ | 274/292 [00:05<00:00, 49.83it/s]
ram used: 9.16 GB, layers.30.feed_forward.w1.weight : 94%|███████████████████████████████████████████████████████████████████████████████████████▌ | 275/292 [00:05<00:00, 49.93it/s]
ram used: 9.16 GB, layers.30.feed_forward.w2.weight : 95%|███████████████████████████████████████████████████████████████████████████████████████▉ | 276/292 [00:05<00:00, 50.06it/s]
ram used: 9.16 GB, layers.30.feed_forward.w3.weight : 95%|████████████████████████████████████████████████████████████████████████████████████████▏ | 277/292 [00:05<00:00, 50.20it/s]
ram used: 9.16 GB, layers.30.attention_norm.weight : 95%|████████████████████████████████████████████████████████████████████████████████████████▌ | 278/292 [00:05<00:00, 50.34it/s]
ram used: 9.16 GB, layers.30.ffn_norm.weight : 96%|████████████████████████████████████████████████████████████████████████████████████████▊ | 279/292 [00:05<00:00, 50.48it/s]
ram used: 9.16 GB, layers.31.attention.wq.weight : 96%|█████████████████████████████████████████████████████████████████████████████████████████▏ | 280/292 [00:05<00:00, 50.62it/s]
ram used: 9.16 GB, layers.31.attention.wk.weight : 96%|█████████████████████████████████████████████████████████████████████████████████████████▍ | 281/292 [00:05<00:00, 50.76it/s]
ram used: 9.16 GB, layers.31.attention.wv.weight : 97%|█████████████████████████████████████████████████████████████████████████████████████████▊ | 282/292 [00:05<00:00, 50.85it/s]
ram used: 9.16 GB, layers.31.attention.wo.weight : 97%|██████████████████████████████████████████████████████████████████████████████████████████▏ | 283/292 [00:05<00:00, 50.98it/s]
ram used: 9.16 GB, layers.31.feed_forward.w1.weight : 97%|██████████████████████████████████████████████████████████████████████████████████████████▍ | 284/292 [00:05<00:00, 51.11it/s]
ram used: 9.16 GB, layers.31.feed_forward.w2.weight : 98%|██████████████████████████████████████████████████████████████████████████████████████████▊ | 285/292 [00:05<00:00, 51.24it/s]
ram used: 9.16 GB, layers.31.feed_forward.w3.weight : 98%|███████████████████████████████████████████████████████████████████████████████████████████ | 286/292 [00:05<00:00, 51.36it/s]
ram used: 9.16 GB, layers.31.attention_norm.weight : 98%|███████████████████████████████████████████████████████████████████████████████████████████▍ | 287/292 [00:05<00:00, 51.48it/s]
ram used: 9.16 GB, layers.31.ffn_norm.weight : 99%|███████████████████████████████████████████████████████████████████████████████████████████▋ | 288/292 [00:05<00:00, 51.59it/s]
ram used: 9.16 GB, norm.weight : 99%|████████████████████████████████████████████████████████████████████████████████████████████ | 289/292 [00:05<00:00, 51.70it/s]
update_peers: added=[] removed=[] updated=[] unchanged=[<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x7606b3947920>, <exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at
0x7606b1b965a0>] to_disconnect=[] to_connect=[]
did_peers_change=False
Received request: GET /v1/download/progress
ram used: 9.16 GB, tok_embeddings.weight : 99%|████████████████████████████████████████████████████████████████████████████████████████████▎| 290/292 [00:06<00:00, 47.50it/s]
ram used: 10.21 GB, output.weight : 100%|████████████████████████████████████████████████████████████████████████████████████████████▋| 291/292 [00:06<00:00, 47.62it/s]
ram used: 10.21 GB, freqs_cis : 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 292/292 [00:06<00:00, 47.75it/s]
ram used: 10.21 GB, freqs_cis : 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 292/292 [00:06<00:00, 47.72it/s]
loaded weights in 6123.58 ms, 10.21 GB loaded at 1.67 GB/s
Checking if local path exists to load tokenizer from local local_path=None
Trying AutoProcessor for /home/fullofcaffeine/.cache/huggingface/hub/models--mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated/snapshots/368c8ed94ce4c986e7b9ca5c159651ef753908ce
Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=20, n_layers=32): size mismatched, can't reshape self.shape=(1, 25, 128256,
4096) -> new_shape=(1, 25, 32, 128)
Traceback (most recent call last):
File "/home/fullofcaffeine/workspace/code/exo/exo/orchestration/standard_node.py", line 211, in _process_tensor
result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/inference/tinygrad/inference.py", line 80, in infer_tensor
h = await asyncio.get_event_loop().run_in_executor(self.executor, lambda: self.model(Tensor(input_data), start_pos, TEMPERATURE).realize())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/inference/tinygrad/inference.py", line 80, in <lambda>
h = await asyncio.get_event_loop().run_in_executor(self.executor, lambda: self.model(Tensor(input_data), start_pos, TEMPERATURE).realize())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/inference/tinygrad/models/llama.py", line 214, in __call__
return self.forward(tokens, start_pos, temperature, top_k, top_p, alpha_f, alpha_p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/inference/tinygrad/models/llama.py", line 202, in forward
h = layer(h, start_pos, freqs_cis, mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/inference/tinygrad/models/llama.py", line 107, in __call__
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/inference/tinygrad/models/llama.py", line 61, in __call__
xq = xq.reshape(xq.shape[0], xq.shape[1], self.n_heads, self.head_dim)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
ret = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 870, in reshape
return F.Reshape.apply(self, shape=new_shape) if new_shape != self.shape else self
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 37, in apply
ret.lazydata, ret.requires_grad, ret.grad = ctx.forward(*[t.lazydata for t in x], **kwargs), ctx.requires_grad, None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/function.py", line 182, in forward
return x.reshape(shape)
^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/lazy.py", line 214, in reshape
def reshape(self, arg:Tuple[sint, ...]): return self._view(self.st.reshape(arg))
^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/shape/shapetracker.py", line 136, in reshape
if getenv("MERGE_VIEW", 1) and (new_view := self.views[-1].reshape(new_shape)) is not None: return ShapeTracker(self.views[0:-1] + (new_view,))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/tinygrad/shape/view.py", line 278, in reshape
raise ValueError(f"size mismatched, can't reshape {self.shape=} -> {new_shape=}")
ValueError: size mismatched, can't reshape self.shape=(1, 25, 128256, 4096) -> new_shape=(1, 25, 32, 128)
SendTensor tensor shard=Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=18, end_layer=26, n_layers=32) tensor=array([[[ 0.0894 , 0.1609 , -0.4956 , ..., 0.407 , 0.5415 ,
-0.329 ],
[ 0.0894 , 0.1609 , -0.4956 , ..., 0.407 , 0.5415 ,
-0.329 ],
[-0.0381 , -0.07245, 0.1406 , ..., 0.2634 , -0.06097,
0.11346],
...,
[ 0.1456 , 0.05814, 0.02051, ..., -0.1453 , -0.1982 ,
-0.02417],
[ 0.1399 , 0.0958 , -0.0939 , ..., -0.0735 , -0.3967 ,
-0.0407 ],
[ 0.1007 , 0.2139 , -0.1554 , ..., -0.1282 , -0.4033 ,
-0.10376]]], dtype=float16) request_id='d1ac67ac-ff04-4be1-a5e8-5f008a8b689e' result: None
Broadcasting opaque status: request_id='d1ac67ac-ff04-4be1-a5e8-5f008a8b689e' status='{"type": "node_status", "node_id": "88a0ac28-6590-4edb-88ca-5095cb74caba", "status": "end_process_tensor",
"base_shard": {"model_id": "mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated", "start_layer": 18, "end_layer": 26, "n_layers": 32}, "shard": {"model_id": "mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated",
"start_layer": 0, "end_layer": 20, "n_layers": 32}, "request_id": "d1ac67ac-ff04-4be1-a5e8-5f008a8b689e", "elapsed_time_ns": 8303717050, "result_size": 0}'
update_peers: added=[] removed=[] updated=[] unchanged=[<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x7606b3947920>, <exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at
Then the RTX 4000 node fails with:
Error connecting peer [email protected]:52959:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/networking/grpc/grpc_peer_handle.py", line 37, in connect
await self.channel.channel_ready()
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/grpc/aio/_channel.py", line 478, in channel_ready
await self.wait_for_state_change(state)
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/grpc/aio/_channel.py", line 471, in wait_for_state_change
assert await self._channel.watch_connectivity_state(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/channel.pyx.pxi", line 97, in watch_connectivity_state
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/fullofcaffeine/workspace/code/exo/exo/orchestration/standard_node.py", line 312, in connect_with_timeout
await asyncio.wait_for(peer.connect(), timeout)
File "/usr/local/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
async with timeouts.timeout(timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
raise TimeoutError from exc_val
TimeoutError
Error connecting peer [email protected]:50171:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/home/fullofcaffeine/workspace/code/exo/exo/networking/grpc/grpc_peer_handle.py", line 37, in connect
await self.channel.channel_ready()
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/grpc/aio/_channel.py", line 478, in channel_ready
await self.wait_for_state_change(state)
File "/home/fullofcaffeine/workspace/code/exo/.venv/lib/python3.12/site-packages/grpc/aio/_channel.py", line 471, in wait_for_state_change
assert await self._channel.watch_connectivity_state(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/channel.pyx.pxi", line 97, in watch_connectivity_state
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/fullofcaffeine/workspace/code/exo/exo/orchestration/standard_node.py", line 312, in connect_with_timeout
await asyncio.wait_for(peer.connect(), timeout)
File "/usr/local/lib/python3.12/asyncio/tasks.py", line 519, in wait_for
async with timeouts.timeout(timeout):
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
raise TimeoutError from exc_val
TimeoutError
Removing download task for Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=27, end_layer=31, n_layers=32): True
Removing download task for Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=21, end_layer=31, n_layers=32): True
Error in cleanup peers: dictionary changed size during iteration
Traceback (most recent call last):
File "/home/fullofcaffeine/workspace/code/exo/exo/networking/udp/udp_discovery.py", line 174, in task_cleanup_peers
for peer_id, (peer_handle, connected_at, last_seen, prio) in self.known_peers.items():
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dictionary changed size during iteration
And finally, here's the log for the M2 node:
(...)
ram used: 16.75 GB, tok_embeddings.weight : 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████▎| 290/292 [00:07<00:00, 36.79it/s]
ram used: 17.81 GB, output.weight : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 291/292 [00:07<00:00, 36.82it/s]
ram used: 17.81 GB, freqs_cis : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 292/292 [00:07<00:00, 36.88it/s]
ram used: 17.81 GB, freqs_cis : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 292/292 [00:07<00:00, 36.81it/s]
loaded weights in 7941.89 ms, 8.90 GB loaded at 1.12 GB/s
Error in cleanup peers: dictionary changed size during iteration
Traceback (most recent call last):
File "/Users/fullofcaffeine/workspace/exo/exo/networking/udp/udp_discovery.py", line 174, in task_cleanup_peers
for peer_id, (peer_handle, connected_at, last_seen, prio) in self.known_peers.items():
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dictionary changed size during iteration
Timeout sending opaque status to 88a0ac28-6590-4edb-88ca-5095cb74caba
Error in cleanup peers: dictionary changed size during iteration
Traceback (most recent call last):
File "/Users/fullofcaffeine/workspace/exo/exo/networking/udp/udp_discovery.py", line 174, in task_cleanup_peers
for peer_id, (peer_handle, connected_at, last_seen, prio) in self.known_peers.items():
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dictionary changed size during iteration
I often only get the first few chars of the LLM answer and then it stops.
Any ideas on why these are failing? All nodes are using the 2b9dec2. I'm using python3.12 on all systems and I'm activating the venv before starting it. On the Linux systems I start it with CUDA=1 exo and on the Mac system with exo --inference-engine tinygrad.
Thanks in advance!
I don't see anything obviously wrong with your setup. It looks all correct.
The logs suggest perhaps some networking issues. The fact that it generates some tokens then stops also confirms a network issue. What network are you running on? What's the bandwidth / latency / jitter between devices like? Can you try pinging or running a small network test with iperf3
Hi Alex! Thanks for the reply.
I don't see anything obviously wrong with your setup. It looks all correct.
Cool, that's good to read. As a side question, I assume the "0TFLOPs" for the two Linux nodes there is not too important then?
What network are you running on?
It's a regular LAN, and the boxes are all connecting via wifi (5ghz). My router is a Synology rt2600ac, all nodes are connected to the same wifi network.
Let me know if you need more info about it or the nodes.
Can you try pinging or running a small network test with iperf3
I didn't know about this tool. I'll try it out and report back the results.
I ran iperf3 as a server in my Mac M2 and then spun up a client from the Quadro RTX5k machine:
Mac output:
iperf3 -s [ruby-2.6.10p210]
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
iperf3: error - unable to receive parameters from client:
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
Accepted connection from 10.0.4.81, port 38314
[ 5] local 10.0.4.39 port 5201 connected to 10.0.4.81 port 38318
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.01 sec 48.1 MBytes 402 Mbits/sec
[ 5] 1.01-2.01 sec 40.9 MBytes 343 Mbits/sec
[ 5] 2.01-3.00 sec 39.9 MBytes 335 Mbits/sec
[ 5] 3.00-4.00 sec 42.1 MBytes 353 Mbits/sec
[ 5] 4.00-5.00 sec 55.4 MBytes 465 Mbits/sec
[ 5] 5.00-6.00 sec 52.8 MBytes 444 Mbits/sec
[ 5] 6.00-7.00 sec 53.8 MBytes 451 Mbits/sec
[ 5] 7.00-8.00 sec 53.5 MBytes 447 Mbits/sec
[ 5] 8.00-9.00 sec 51.2 MBytes 431 Mbits/sec
[ 5] 9.00-10.00 sec 53.8 MBytes 450 Mbits/sec
[ 5] 10.00-10.02 sec 1.38 MBytes 512 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.02 sec 493 MBytes 412 Mbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #3)
-----------------------------------------------------------
Linux output:
Connecting to host 10.0.4.39, port 5201
[ 5] local 10.0.4.81 port 38318 connected to 10.0.4.39 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 49.9 MBytes 418 Mbits/sec 593 2.34 MBytes
[ 5] 1.00-2.00 sec 41.2 MBytes 346 Mbits/sec 352 1.74 MBytes
[ 5] 2.00-3.00 sec 40.0 MBytes 336 Mbits/sec 0 1.84 MBytes
[ 5] 3.00-4.00 sec 42.5 MBytes 357 Mbits/sec 0 1.91 MBytes
[ 5] 4.00-5.00 sec 56.2 MBytes 472 Mbits/sec 37 1.40 MBytes
[ 5] 5.00-6.00 sec 52.5 MBytes 440 Mbits/sec 0 1.48 MBytes
[ 5] 6.00-7.00 sec 53.8 MBytes 451 Mbits/sec 0 1.55 MBytes
[ 5] 7.00-8.00 sec 53.8 MBytes 451 Mbits/sec 0 1.59 MBytes
[ 5] 8.00-9.00 sec 51.2 MBytes 430 Mbits/sec 0 1.62 MBytes
[ 5] 9.00-10.00 sec 53.8 MBytes 451 Mbits/sec 0 1.64 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 495 MBytes 415 Mbits/sec 982 sender
[ 5] 0.00-10.02 sec 493 MBytes 412 Mbits/sec receiver
iperf Done.
Do you see anything off? Let me know if you need more data.
Thanks!
Same problem.