llama-stack Multiple inference providers unconditionally adding BOS token to completion prompts

System Info

Latest llama-stack from main as of this bug report (029e4fc64d9017eed625c927a69e71fff9033727)

🐛 Describe the bug

When calling the completions API, multiple inference providers are adding Llama-specific BOS tokens into my prompt.

Here's a simple reproducing test:

inlined, for context:

import pytest

from llama_stack.apis.inference import CompletionRequest
from llama_stack.providers.utils.inference.prompt_adapter import completion_request_to_prompt


@pytest.mark.asyncio
async def test_foo():
    expected_prompt = "foo"
    request = CompletionRequest(
        model = "a model",
        content = expected_prompt,
    )
    prompt = await completion_request_to_prompt(request)
    assert prompt == expected_prompt

Error logs

==================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /home/bbrownin/src/llama-stack/venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'Linux-6.13.6-100.fc40.x86_64-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'html': '4.1.1', 'langsmith': '0.3.15'}}
rootdir: /home/bbrownin/src/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, metadata-3.1.1, asyncio-0.25.3, html-4.1.1, langsmith-0.3.15
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 1 item                                                                                                                                                                                                                                             

tests/unit/providers/inference/test_prompt_adapter.py::test_foo FAILED

========================================================================================================================== FAILURES ==========================================================================================================================
__________________________________________________________________________________________________________________________ test_foo __________________________________________________________________________________________________________________________

    @pytest.mark.asyncio
    async def test_foo():
        expected_prompt = "foo"
        request = CompletionRequest(
            model = "a model",
            content = expected_prompt,
        )
        prompt = await completion_request_to_prompt(request)
>       assert prompt == expected_prompt
E       AssertionError: assert '<|begin_of_text|>foo' == 'foo'
E         
E         - foo
E         + <|begin_of_text|>foo

tests/unit/providers/inference/test_prompt_adapter.py:21: AssertionError

Expected behavior

The helper utility completion_request_to_prompt should not be unconditionally adding the Llama BOS token to the start of completion prompts. Or, if it continues to do that, each inference provider that uses it for completion prompts needs to be adjusted to only use it when the model in question is a Llama model.

Mar 20 '25 20:03 bbrowning

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

May 20 '25 00:05 github-actions[bot]

I believe this is still relevant, although I also haven't seen any other recent user reports of it. So, commenting here to keep this from being automatically closed quite yet until I dig into this more.

May 20 '25 10:05 bbrowning

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

Jul 20 '25 00:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!

Aug 20 '25 00:08 github-actions[bot]