lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Clarification on API Endpoint: /v1/completions vs /v1/chat/completions
code in gguf.py 62 lines
response = requests.post(
f"{self.base_url}/v1/completions", json=request
)
I've been exploring the publicly available OpenAPI documentation and came across a point of confusion regarding the API endpoints. Specifically, I have two questions that I hope can be clarified:
In the OpenAPI documentation, I could not find an endpoint named {url}/v1/completions. I was wondering if this is an oversight in the documentation or if the endpoint is not supported.
Could you please confirm if the correct endpoint for obtaining completions is /v1/chat/completions? I'm asking because this seems to be the relevant endpoint for what I'm trying to achieve, but I want to ensure I'm using the API correctly.
Any clarification or additional information you can provide would be greatly appreciated.
Hi! Could you share a link to the documentation you are referencing?
This GGUF integration was designed around base LMs (e.g., ones not using a chat template), and hence v1/completions
was at least the right choice at the time of integration. No matter the solution we should probably document this better for users, and potentially support both options.
The reference of documentation is that llama_server. And I try to rewrtire gguf.py like follow:
import logging
import time
import requests
from requests.exceptions import RequestException
from tqdm import tqdm
from lm_eval.api.model import LM
from lm_eval.api.registry import register_model
logger = logging.getLogger(__name__)
def get_result(logprobs, context_length):
is_greedy = True
continuation_logprobs = 0
idx = context_length + 1
for i in range(idx, len(logprobs)):
current_probs = logprobs[i]["probs"]
token = logprobs[i]["content"]
continuation_logprobs += current_probs[0]["prob"]
top_token = ""
top_rate = -1 #
for t in current_probs:
if t["prob"] > top_rate:
top_rate = t["prob"]
top_token = t["tok_str"]
# 检查是否贪婪
if top_token != token:
is_greedy = False
break
return continuation_logprobs, is_greedy
@register_model("gguf", "ggml")
class GGUFLM(LM):
def __init__(self, base_url=None, max_length=2048, **kwargs):
super().__init__()
self.base_url = base_url
assert self.base_url, "must pass `base_url` to use GGUF LM!"
self.logprobs = 10
self.temperature = 0.0
self.max_length = max_length
def gguf_completion(
self, context, continuation=None, stop=None, retries=3, delay=5, **kwargs
):
print(context)
for _ in range(retries):
try:
prompt = context
request = {
"prompt": prompt,
"n_probs": self.logprobs,
"temperature": self.temperature,
"n_predict": self.max_length
}
if continuation:
prompt += continuation
request.update({"prompt": prompt, "max_tokens": 1, "echo": True})
if stop is not None:
request["stop"] = stop
response = requests.post(
f"{self.base_url}/v1/completions", json=request
)
response.raise_for_status()
return response.json()
except RequestException as e:
logger.error(f"RequestException: {e}")
time.sleep(delay) # wait before retrying
else:
raise Exception(f"Failed to get a valid response after {retries} retries.")
def loglikelihood(self, requests, disable_tqdm: bool = False):
if not requests:
return []
res = []
for context, continuation in tqdm(
[req.args for req in requests], disable=disable_tqdm
):
response = self.gguf_completion(context=context, continuation=continuation)
if response and "content" in response and response["content"]:
choice = response
logprobs = choice["completion_probabilities"]
if (
logprobs
):
logprob, is_greedy = get_result(logprobs, len(context))
res.append((logprob, is_greedy))
else:
logger.warning(
"Invalid logprobs data. Expected 'logprobs' to contain 'token_logprobs' list."
)
else:
logger.error(
f"Invalid response for loglikelihood. Response: {response}"
)
assert False
return res
def generate_until(self, requests, disable_tqdm: bool = False):
if not requests:
return []
res = []
for request in tqdm([req.args for req in requests], disable=disable_tqdm):
inp = request[0]
request_args = request[1]
until = request_args.get("until", ["</s>"])
response = self.gguf_completion(context=inp, stop=until)
if response and "content" in response and response["content"]:
choice = response
if "content" in choice:
generated_text = choice["content"].strip()
res.append(generated_text)
else:
logger.error(
f"Invalid response for greedy_until. Response: {response}"
)
res.append(None) # Add default value in case of error
else:
logger.error(f"Invalid response for greedy_until. Response: {response}")
res.append(None) # Add default value in case of error
return res
def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
raise NotImplementedError(
"loglikelihood_rolling not yet supported for GGUF models"
)
But, it maybe wrong with loglikelihood, I run below code and get accuracy just 0.49
import json
from lm_eval.models.gguf import GGUFLM
from lm_eval import simple_evaluate
lm = GGUFLM(base_url="http://localhost:8080")
# tasks.initialize_tasks()
results = simple_evaluate(model=lm,tasks=["piqa"])
#将results中的数据导出到json文件中
filtered_results = results.copy()
filtered_results = {key: value for key, value in results.items() if key != "samples"}
json_filtered_results = json.dumps(filtered_results, indent=4)
with open("results.json", "w") as json_file:
json_file.write(json_filtered_results)
result:
{
"results": {
"piqa": {
"acc,none": 0.4923830250272035,
"acc_stderr,none": 0.011664470424044978,
"acc_norm,none": 0.4923830250272035,
"acc_norm_stderr,none": 0.011664470424044978,
"alias": "piqa"
}
},
"group_subtasks": {
"piqa": []
},
"configs": {
"piqa": {
"task": "piqa",
"dataset_path": "piqa",
"training_split": "train",
"validation_split": "validation",
"doc_to_text": "Question: {{goal}}\nAnswer:",
"doc_to_target": "label",
"doc_to_choice": "{{[sol1, sol2]}}",
"description": "",
"target_delimiter": " ",
"fewshot_delimiter": "\n\n",
"num_fewshot": 0,
"metric_list": [
{
"metric": "acc",
"aggregation": "mean",
"higher_is_better": true
},
{
"metric": "acc_norm",
"aggregation": "mean",
"higher_is_better": true
}
],
"output_type": "multiple_choice",
"repeats": 1,
"should_decontaminate": true,
"doc_to_decontamination_query": "goal",
"metadata": {
"version": 1.0
}
}
},
"versions": {
"piqa": 1.0
},
"n-shot": {
"piqa": 0
},
"config": {
"model": "GGUFLM",
"model_args": null,
"batch_size": null,
"batch_sizes": [],
"device": null,
"use_cache": null,
"limit": null,
"bootstrap_iters": 100000,
"gen_kwargs": null
},
"git_hash": "4600d6bf",
"date": 1711460633.7369168,
"pretty_env_info": "PyTorch version: 2.2.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 20.04.6 LTS (aarch64)\nGCC version: (GCC) 11.2.0\nClang version: Could not collect\nCMake version: version 3.16.3\nLibc version: glibc-2.31\n\nPython version: 3.10.14 (main, Mar 21 2024, 16:18:23) [GCC 11.2.0] (64-bit runtime)\nPython platform: Linux-5.4.0-153-generic-aarch64-with-glibc2.31\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: aarch64\nCPU op-mode(s): 32-bit, 64-bit\nByte Order: Little Endian\nCPU(s): 8\nOn-line CPU(s) list: 0-7\nThread(s) per core: 1\nCore(s) per socket: 8\nSocket(s): 1\nNUMA node(s): 1\nVendor ID: ARM\nModel: 0\nStepping: r0p0\nCPU max MHz: 3000.0000\nCPU min MHz: 3000.0000\nBogoMIPS: 100.00\nL1d cache: 512 KiB\nL1i cache: 512 KiB\nL2 cache: 8 MiB\nL3 cache: 64 MiB\nNUMA node0 CPU(s): 0-7\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1: Mitigation; __user pointer sanitization\nVulnerability Spectre v2: Mitigation; CSV2, BHB\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\nFlags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] torch==2.2.1\n[conda] numpy 1.26.4 pypi_0 pypi\n[conda] torch 2.2.1 pypi_0 pypi",
"transformers_version": "4.39.1",
"upper_git_hash": null
}
hi, I met the same problem as yours, have you solved this problem?
What model here is being evaluated? I will try to look into this soon.
What model here is being evaluated? I will try to look into this soon.
Qwen1.8B I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?
hi, I met the same problem as yours, have you solved this problem?
try to use llama-cpp-python replace llama.cpp start server
hi, I met the same problem as yours, have you solved this problem?
try to use llama-cpp-python replace llama.cpp start server
Thanks for your reply, I' ll try it later.
What model here is being evaluated? I will try to look into this soon.
Qwen1.8B I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?
I found that the way to compute log_logits_sum
between gguf.py and huggingface.py are different.
In huggingface.py, it obtain log-probs at the corresponding continuation token indices.
but in gguf.py, it just compute the sum of the prob of model output token, do not select the prob of token by continuation token indices.
Am I right to understand it this way? @haileyschoelkopf
In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while
loop in the screenshot skips any context tokens).
I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa
accuracy in the huggingface implementation?
In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the
while
loop in the screenshot skips any context tokens).I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's
piqa
accuracy in the huggingface implementation?
Thank you for your assistance. I've tested Qwen-1.8B's accuracy on PIQA and found it to be 0.68, which I am not sure it is right?