exo RuntimeError: Wait timeout: 10000 ms (local run)

raceback (most recent call last):
  File "/home/ffamax/exo/exo/api/chatgpt_api.py", line 273, in handle_post_chat_completions
    await asyncio.wait_for(self.node.process_prompt(shard, prompt, image_str, request_id=request_id), timeout=self.response_timeout)
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/ffamax/exo/exo/orchestration/standard_node.py", line 98, in process_prompt
    resp = await self._process_prompt(base_shard, prompt, image_str, request_id, inference_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/exo/exo/orchestration/standard_node.py", line 134, in _process_prompt
    result, inference_state, is_finished = await self.inference_engine.infer_prompt(request_id, shard, prompt, image_str, inference_state=inference_state)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 59, in infer_prompt
    await self.ensure_shard(shard)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 97, in ensure_shard
    self.model = await asyncio.get_event_loop().run_in_executor(self.executor, build_transformer, model_path, shard, "8B" if "8b" in shard.model_id.lower() else "70B")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 48, in build_transformer
    load_state_dict(model, weights, strict=False, consume=False)  # consume=True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
    else: v.replace(state_dict[k].to(v.device)).realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 224, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 174, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 140, in __call__
    self.copy(dest, src)
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 135, in copy
    dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/device.py", line 114, in as_buffer
    return self.copyout(memoryview(bytearray(self.nbytes)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/device.py", line 125, in copyout
    self.allocator.copyout(mv, self._buf)
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/device.py", line 657, in copyout
    self.device.synchronize()
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/device.py", line 519, in synchronize
    self.timeline_signal.wait(self.timeline_value - 1)
  File "/home/ffamax/miniconda3/envs/.venv.py3.12/lib/python3.12/site-packages/tinygrad/device.py", line 424, in wait
    raise RuntimeError(f"Wait timeout: {timeout} ms! (the signal is not set to {value}, but {self.value})")
RuntimeError: Wait timeout: 10000 ms! (the signal is not set to 19, but 0)
Deregister callback_id='chatgpt-api-wait-response-b71dd1bf-c1f7-4ea5-a626-1ddd6febcaf1' deregistered_callback=None

Oct 06 '24 01:10 FFAMax

Prompted from CLI - the same:

.venv.py3.12) ffamax@srv4090:~/exo$ DEBUG=2 SUPPORT_BF16=0 exo run llama-3.1-8b --prompt "hi"
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

  _____  _____  
 / _ \ \/ / _ \ 
|  __/>  < (_) |
 \___/_/\_\___/ 
    
Detected system: Linux
Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader
Trying to find available port port=55797
[63156, 51540, 55328, 63288, 52742, 58298, 60339, 50307]
Using available port: 55797
Retrieved existing node ID: 434ccf86-78b3-40e9-9b07-29954834f823
Chat interface started:
 - http://10.1.3.172:8000
 - http://127.0.0.1:8000
ChatGPT API endpoint served at:
 - http://10.1.3.172:8000/v1/chat/completions
 - http://127.0.0.1:8000/v1/chat/completions
tinygrad Device.DEFAULT='NV'
NVIDIA device gpu_name='NVIDIA GEFORCE RTX 3060' gpu_memory_info=<pynvml.c_nvmlMemory_t object at 0x7f749a010650>
Server started, listening on 0.0.0.0:55797
tinygrad Device.DEFAULT='NV'
NVIDIA device gpu_name='NVIDIA GEFORCE RTX 3060' gpu_memory_info=<pynvml.c_nvmlMemory_t object at 0x7f7499eb02d0>
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
Collecting topology max_depth=4 visited=set()
Collected topology: Topology(Nodes: {434ccf86-78b3-40e9-9b07-29954834f823: Model: Linux Box (NVIDIA GEFORCE RTX 3060). Chip: NVIDIA GEFORCE RTX 3060. Memory: 12288MB. Flops: fp32: 13.00 TFLOPS, 
fp16: 26.00 TFLOPS, int8: 52.00 TFLOPS}, Edges: {})
Checking if local path exists to load tokenizer from local 
local_path=PosixPath('/home/ffamax/.cache/huggingface/hub/models--mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated/snapshots/368c8ed94ce4c986e7b9ca5c159651ef753908ce')
Resolving tokenizer for model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated' from 
local_path=PosixPath('/home/ffamax/.cache/huggingface/hub/models--mlabonne--Meta-Llama-3.1-8B-Instruct-abliterated/snapshots/368c8ed94ce4c986e7b9ca5c159651ef753908ce')
Processing prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

hi<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Oct 06 '24 01:10 FFAMax

On first sight this looks like it might be something in tinygrad itself.

Oct 06 '24 13:10 AlexCheema