EOFError with guidance 0.2.0 when trying to work through intro notebook
The bug
I'm trying to work through the introductory notebook here and can't get past the very first example.
To Reproduce
from guidance import models
mistral = models.LlamaCpp(
# downloaded from: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q8_0.gguf
"models/mistral-7b-instruct-v0.2.Q8_0.gguf",
n_gpu_layers=-1,
n_ctx=4096,
)
lm = mistral + "Who won the last Kentucky derby and by how much?"
When I run this I get a segmentation fault:
$ python example.py
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
Segmentation fault: 11
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): macOS 15.3.1
- Python version 3.11.11
- Guidance Version (
guidance.__version__):guidance 0.2.0 llama_cpp_python 0.3.7 torch 2.6.0 transformers 4.49.0
Please see my update below on other library versions I tried, as I believe there are issues both with guidance and llama-cpp-python.
I am encountering the same issue. If I try to only use llama_cpp like below, it works fine.
from llama_cpp import Llama
llm = Llama(
model_path="/home/dockeruser/.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45",
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
# seed=1337, # Uncomment to set a specific seed
# n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
@nchammas @vintertown Can't remember the reason for this error but it has to do something with the llamacpp version. Try installing a previous version, for me 0.2.90 worked.
Downgrading llama-cpp-python from 0.3.7 to 0.3.6 avoids the segmentation fault but results in another error:
$ python example.py
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/spawn.py", line 131, in _main
prepare(preparation_data)
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen runpy>", line 291, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "...home/.../example.py", line 29, in <module>
model = models.LlamaCpp(
^^^^^^^^^^^^^^^^
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/llama_cpp/_llama_cpp.py", line 358, in __init__
engine = LlamaCppEngine(
^^^^^^^^^^^^^^^
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/llama_cpp/_llama_cpp.py", line 158, in __init__
super().__init__(
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/_model.py", line 334, in __init__
self.monitor = Monitor(self.metrics)
^^^^^^^^^^^^^^^^^^^^^
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/_model.py", line 2099, in __init__
self.mp_manager = Manager()
^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/context.py", line 57, in Manager
m.start()
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/managers.py", line 563, in start
self._process.start()
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/spawn.py", line 164, in get_preparation_data
_check_not_importing_main()
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/spawn.py", line 140, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
To fix this issue, refer to the "Safe importing of main module"
section in https://docs.python.org/3/library/multiprocessing.html
Traceback (most recent call last):
File "...home/.../example.py", line 29, in <module>
model = models.LlamaCpp(
^^^^^^^^^^^^^^^^
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/llama_cpp/_llama_cpp.py", line 358, in __init__
engine = LlamaCppEngine(
^^^^^^^^^^^^^^^
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/llama_cpp/_llama_cpp.py", line 158, in __init__
super().__init__(
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/_model.py", line 334, in __init__
self.monitor = Monitor(self.metrics)
^^^^^^^^^^^^^^^^^^^^^
File "...home/.../.venv/lib/python3.11/site-packages/guidance/models/_model.py", line 2099, in __init__
self.mp_manager = Manager()
^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/context.py", line 57, in Manager
m.start()
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/managers.py", line 567, in start
self._address = reader.recv()
^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "...home/.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError
Trying llama-cpp-python at 0.2.90 and again at 0.2.25 results in the same EOFError.
for me 0.2.90 worked
@adityaprakash-work Can you clarify what you mean by "worked"? Did you run exactly the repro script I posted with guidance 0.2.0 and llama-cpp-python 0.2.90?
OK, I got this to work by downgrading both guidance and llama-cpp-python. If I use the latest version of either library, there is a problem.
Works:
guidance 0.1.16
llama_cpp_python 0.3.6
EOFError:
guidance 0.2.0
llama_cpp_python 0.3.6
Segmentation fault:
guidance 0.1.16 or 0.2.0
llama_cpp_python 0.3.7
So I believe there are separate issues with both guidance and llama-cpp-python.
I can replicate this as well. 0.2.0 does not seem to be usable. I can confirm that downgrading to guidance 0.1.16 and llamacpp 0.3.6 allows me to generate tokens.
The issue does not seem to be limited to just LlamaCpp, even the Transformers backend does not seem to work:
from guidance import models, gen
# path = 'mistral-7b-instruct-v0.2.Q8_0.gguf'
# mistral = models.LlamaCpp(path)
path = 'HuggingFaceTB/SmolLM2-135M-Instruct'
model = models.Transformers(path)
# append text or generations to the model
print(model + f'Do you want a joke or a poem? ' + gen(max_tokens=100))
Yields the following error:
(guidance) λ ~/dottxt/debug/guidance/ python tst.py
/home/cameron/dottxt/debug/guidance/.venv/lib/python3.12/site-packages/guidance/chat.py:80: UserWarning: Chat template {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %} was unable to be loaded directly into guidance.
Defaulting to the ChatML format which may not be optimal for the selected model.
For best results, create and pass in a `guidance.ChatTemplate` subclass for your model.
warnings.warn(
thread '<unnamed>' panicked at toktrie/src/toktree.rs:563:37:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/home/cameron/dottxt/debug/guidance/tst.py", line 10, in <module>
print(mistral + f'Do you want a joke or a poem? ' + gen(max_tokens=100))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
File "/home/cameron/dottxt/debug/guidance/.venv/lib/python3.12/site-packages/guidance/models/_model.py", line 1207, in __add__
out = lm._run_stateless(value)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/dottxt/debug/guidance/.venv/lib/python3.12/site-packages/guidance/models/_model.py", line 1413, in _run_stateless
for chunk in gen_obj:
^^^^^^^
File "/home/cameron/dottxt/debug/guidance/.venv/lib/python3.12/site-packages/guidance/models/_model.py", line 453, in __call__
mask, ll_response = mask_fut.result()
^^^^^^^^^^^^^^^^^
File "/home/cameron/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/cameron/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/cameron/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/dottxt/debug/guidance/.venv/lib/python3.12/site-packages/guidance/_parser.py", line 97, in compute_mask
mask, ll_response_string = self.ll_interpreter.compute_mask()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
Here's my version info for the transformers error.
The fact that the 0.2.0 release is so broken suggests that there is some notable gap in the continuous integration tests for this project, as I assume the team behind it did not know they were publishing such a broken release.
Hi All,
Want to apologize here -- we refactored significant chunks of code for v0.2, including the spin off of a "low level" library. I think part of these errors are caused by our internal dependencies now being out of sync with the parent library, and poor pinning on our part. We're planning a release this week that should hopefully address all of this. Appreciate your patience while we get this all sorted!
-Harsha
It seems that the EOFError is due to multiprocessing.Manager being used to manage a subprocess that collects certain metrics.
The error message you reported indicates that this is due to (1) not running this code in an if __name__ == '__main__': guard and (2) the default behavior of python on macOS to use spawn instead of fork to start child processes.
@nchammas you can resolve your problem by either using said guard or by changing your multiprocessing start method to fork.
@nopdive I'm getting a broken pipe on the monitoring process at system exit either way, so there's definitely still something to fix there. I'd also probably ensure that the monitoring is opt-in in non-interactive sessions rather than opt-out (which may be a more sensible default for notebooks).
The error message you reported indicates that this is due to (1) not running this code in an
if __name__ == '__main__':guard and (2) the default behavior of python on macOS to usespawninstead offorkto start child processes.
I hope that the __name__ guard is just a quick hack for now and not somehow required by design. It would be weird if Guidance did not support running Python scripts without this guard, though I understand they are good practice especially if you are writing a library.
With regards to using fork, I note that the Python docs warn that it is unsafe on macOS:
Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess as macOS system libraries may start threads. See bpo-33725.
In any case, the guard itself doesn't completely resolve the problem. I do get output, but I also get an error alongside it:
# example.py
from guidance import models, lark
grammar_def = "start: /\w+/"
if __name__ == "__main__":
grammar = lark(grammar_def)
model = models.Transformers("microsoft/Phi-4-mini-instruct")
print(model + "This is a word: " + grammar)
$ python example.py
Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:10<00:00, 5.23s/it]
.../.venv/lib/python3.11/site-packages/guidance/chat.py:80: UserWarning: Chat template {% for message in messages %}{% if message['role'] == 'system' and 'tools' in message and message['tools'] is not none %}{{ '<|' + message['role'] + '|>' + message['content'] + '<|tool|>' + message['tools'] + '<|/tool|>' + '<|end|>' }}{% else %}{{ '<|' + message['role'] + '|>' + message['content'] + '<|end|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>' }}{% else %}{{ eos_token }}{% endif %} was unable to be loaded directly into guidance.
Defaulting to the ChatML format which may not be optimal for the selected model.
For best results, create and pass in a `guidance.ChatTemplate` subclass for your model.
warnings.warn(
gpustat is not installed, run `pip install gpustat` to collect GPU stats.
This is a word: 1
Error in monitoring:
---------------------------------------------------------------------------
Traceback (most recent call last):
File ".../.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/managers.py", line 260, in serve_client
self.id_to_local_proxy_obj[ident]
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '10d4b9310'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/managers.py", line 262, in serve_client
raise ke
File ".../.pyenv/versions/3.11.11/lib/python3.11/multiprocessing/managers.py", line 256, in serve_client
obj, exposed, gettypeid = id_to_obj[ident]
~~~~~~~~~^^^^^^^
KeyError: '10d4b9310'
---------------------------------------------------------------------------
Interestingly, if I run the test several times, on occasion I do not get the KeyError, though I do still get the text Error in monitoring:.
I am running guidance at 22df35a, which is the latest on main as of this moment.
@nchammas glad you're at the very least able to make it run now. We'll look into alternatives to multiprocessing.Manager, as I agree that requiring the guard makes this a very leaky abstraction. @nopdive let's also definitely figure out the error in monitoring and as I said before, probably disable it entirely in non-interactive sessions (at least by default).