mlc-llm
mlc-llm copied to clipboard
[Question] mlc-llm server cannot return correct logprobs
❓ General Questions
Steps to reproduce the behavior:
mlc_llm serve --model-lib /mnt/data/ehdd1/home/models/mlc/libs/Llama-2-7b-chat-hf-q0f16-O0-cuda.so /mnt/data/ehdd1/home/models/mlc/Llama-2-7b-chat-hf-q0f16-MLC/
python test.py
test.py as following:
import requests
import json
MLC_SERVER_URL = "http://127.0.0.1:8000/v1/completions"
request_payload = {
"model": "/mnt/data/ehdd1/home/models/mlc/Llama-2-7b-chat-hf-q0f16-MLC/",
"prompt": "1 + 1 =",
"max_tokens": 5,
"logprobs": 5,
"temperature": 0,
"top_p": 1,
}
response = requests.post(MLC_SERVER_URL, json=request_payload)
if response.status_code == 200:
result = response.json()
print(json.dumps(result, indent=4))
if "choices" in result:
for choice in result["choices"]:
if "logprobs" in choice and choice["logprobs"] is not None:
print("\n Successfully retrieved logprobs:\n", json.dumps(choice["logprobs"], indent=4))
else:
print("\n logprobs is empty, MLC might not have computed logprobs.")
else:
print(f"\n Request failed: {response.status_code}")
print(response.text)
Expected behavior
I am using the above script to test the returned logprob. However, I noticed that when I run mlc_llm serve, the returned logprob is always empty, while the tokens are generated correctly. This prevents me from using the lm_eval tool to evaluate the model's performance.
Using the same script, I tested with llama-cpp-python by starting its server, and in this case, logprob was correctly generated. This allowed me to use lm_eval to evaluate the model's performance properly, proving that my script is correct and my model is not the issue. Also, the llama-cpp-python uses the Openai API, which is the same as mlc-llm.
Therefore, I am wondering if there is a bug in mlc-llm that prevents it from returning logprob correctly.
Output for llama-cpp-python: (which is also the expected result with the correct logprobs)
{
"id": "cmpl-6fad5dd2-0c56-4cf3-80be-580028c487dc",
"object": "text_completion",
"created": 1739426316,
"model": "/mnt/data/ehdd1/home/models/mlc/Llama-2-7b-chat-hf-q0f16-MLC/",
"choices": [
{
"text": " 2\n\n2",
"index": 0,
"logprobs": {
"text_offset": [
7,
8,
9,
10,
11
],
"token_logprobs": [
-0.12357338517904282,
-0.09533079713582993,
-0.586503267288208,
-0.46185949444770813,
-1.2555230855941772
],
"tokens": [
" ",
"2",
"\n",
"\n",
"2"
],
"top_logprobs": [
{
" ": -0.12357338517904282,
" ?": -3.4502124786376953,
" ": -3.5839529037475586,
"\n": -4.564411163330078,
" _": -5.367070198059082
},
{
"2": -0.09533079713582993,
"3": -3.1487088203430176,
"1": -3.6742844581604004,
"4": -5.479968547821045,
"0": -6.0776143074035645
},
{
"\n": -0.586503267288208,
",": -2.135464906692505,
".": -2.315242052078247,
" ": -3.1089794635772705,
" (": -3.291478395462036
},
{
"\n": -0.46185949444770813,
"```": -2.412043333053589,
" ": -3.3926141262054443,
" ": -3.8115532398223877,
"(": -4.226720809936523
},
{
"2": -1.2555230855941772,
"This": -2.669368267059326,
"But": -2.814023494720459,
"1": -3.256460666656494,
"3": -3.4381399154663086
}
]
},
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 5,
"total_tokens": 12
}
}
Successfully retrieved logprobs:
{
"text_offset": [
7,
8,
9,
10,
11
],
"token_logprobs": [
-0.12357338517904282,
-0.09533079713582993,
-0.586503267288208,
-0.46185949444770813,
-1.2555230855941772
],
"tokens": [
" ",
"2",
"\n",
"\n",
"2"
],
"top_logprobs": [
{
" ": -0.12357338517904282,
" ?": -3.4502124786376953,
" ": -3.5839529037475586,
"\n": -4.564411163330078,
" _": -5.367070198059082
},
{
"2": -0.09533079713582993,
"3": -3.1487088203430176,
"1": -3.6742844581604004,
"4": -5.479968547821045,
"0": -6.0776143074035645
},
{
"\n": -0.586503267288208,
",": -2.135464906692505,
".": -2.315242052078247,
" ": -3.1089794635772705,
" (": -3.291478395462036
},
{
"\n": -0.46185949444770813,
"```": -2.412043333053589,
" ": -3.3926141262054443,
" ": -3.8115532398223877,
"(": -4.226720809936523
},
{
"2": -1.2555230855941772,
"This": -2.669368267059326,
"But": -2.814023494720459,
"1": -3.256460666656494,
"3": -3.4381399154663086
}
]
}
Output for mlc-llm: (which is incorrect)
{
"id": "cmpl-f5c2dd151fcc4a9dad8efd82d4d1c5d1",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": {
"text_offset": null,
"token_logprobs": [
0.0
],
"tokens": [
""
],
"top_logprobs": [
{
"": -23.0259,
"<unk>": -23.0259,
"<s>": -23.0259,
"</s>": -23.0259
}
]
},
"text": "2.\n\n"
}
],
"created": 1739426790,
"model": "/mnt/data/ehdd1/home/models/mlc/Llama-2-7b-chat-hf-q0f16-MLC/",
"object": "text_completion",
"usage": {
"prompt_tokens": 6,
"completion_tokens": 5,
"total_tokens": 11,
"extra": null
}
}
Successfully retrieved logprobs:
{
"text_offset": null,
"token_logprobs": [
0.0
],
"tokens": [
""
],
"top_logprobs": [
{
"": -23.0259,
"<unk>": -23.0259,
"<s>": -23.0259,
"</s>": -23.0259
}
]
}
Environment
Platform: CUDA Operating system: Ubuntu Device: Nvidia A6000 How you installed MLC-LLM (source)