exo
exo copied to clipboard
tinygrad inference engine fails with BEAM=1 due to not running on main thread
This only happens with BEAM=1. BEAM=0, BEAM=2, BEAM=3 all work fine
This happens because exo runs tinygrad inference on another thread.
Example command to reproduce: DEBUG=6 BEAM=1 python3 main.py --inference-engine tinygrad --run-model llama-3.1-8b
Error:
Error processing prompt: signal only works in main thread of the main interpreter
Traceback (most recent call last):
File "/Users/alex/exo/main.py", line 158, in run_model_cli
await node.process_prompt(shard, prompt, None, request_id=request_id)
File "/Users/alex/exo/exo/orchestration/standard_node.py", line 98, in process_prompt
resp = await self._process_prompt(base_shard, prompt, image_str, request_id, inference_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/exo/orchestration/standard_node.py", line 134, in _process_prompt
result, inference_state, is_finished = await self.inference_engine.infer_prompt(request_id, shard, prompt, image_str, inference_state=inference_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/exo/inference/tinygrad/inference.py", line 67, in infer_prompt
h = await asyncio.get_event_loop().run_in_executor(self.executor, lambda: self.model(Tensor([toks]), start_pos, TEMPERATURE).realize())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/exo/inference/tinygrad/inference.py", line 67, in <lambda>
h = await asyncio.get_event_loop().run_in_executor(self.executor, lambda: self.model(Tensor([toks]), start_pos, TEMPERATURE).realize())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/exo/inference/tinygrad/models/llama.py", line 214, in __call__
return self.forward(tokens, start_pos, temperature, top_k, top_p, alpha_f, alpha_p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/exo/inference/tinygrad/models/llama.py", line 193, in forward
mask = Tensor.full((1, 1, seqlen, start_pos + seqlen), float("-100000000"), dtype=x.dtype, device=x.device).triu(start_pos + 1).realize() if seqlen > 1
else None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 3414, in _wrapper
ret = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/tensor.py", line 208, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 221, in run_schedule
for ei in lower_schedule(schedule):
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 214, in lower_schedule
raise e
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 208, in lower_schedule
try: yield lower_schedule_item(si)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 192, in lower_schedule_item
runner = get_runner(si.outputs[0].device, si.ast)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 157, in get_runner
prg: Program = get_kernel(Device[dname].renderer, ast).to_program()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/realize.py", line 31, in get_kernel
k = beam_search(kb, rawbufs, BEAM.value, bool(getenv("BEAM_ESTIMATE", 1)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/search.py", line 151, in beam_search
for i,proc in (map(_compile_fn, enumerate(acted_lins)) if beam_pool is None else beam_pool.imap_unordered(_compile_fn, enumerate(acted_lins))):
File "/Users/alex/exo/.venv/lib/python3.12/site-packages/tinygrad/engine/search.py", line 60, in _try_compile_linearized_w_idx
signal.signal(signal.SIGALRM, timeout_handler)
File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/signal.py", line 58, in signal
handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: signal only works in main thread of the main interpreter
This more generally looks like a race condition that can happen with other BEAM levels. It's because of the way tinygrad uses signals.