lm-evaluation-harness Clarification on API Endpoint: /v1/completions vs /v1/chat/completions

code in gguf.py 62 lines

response = requests.post(
                    f"{self.base_url}/v1/completions", json=request
                )

I've been exploring the publicly available OpenAPI documentation and came across a point of confusion regarding the API endpoints. Specifically, I have two questions that I hope can be clarified:

In the OpenAPI documentation, I could not find an endpoint named {url}/v1/completions. I was wondering if this is an oversight in the documentation or if the endpoint is not supported.

Could you please confirm if the correct endpoint for obtaining completions is /v1/chat/completions? I'm asking because this seems to be the relevant endpoint for what I'm trying to achieve, but I want to ensure I'm using the API correctly.

Any clarification or additional information you can provide would be greatly appreciated.

Mar 26 '24 02:03 gerayking

Hi! Could you share a link to the documentation you are referencing?

This GGUF integration was designed around base LMs (e.g., ones not using a chat template), and hence v1/completions was at least the right choice at the time of integration. No matter the solution we should probably document this better for users, and potentially support both options.

Mar 26 '24 20:03 haileyschoelkopf

The reference of documentation is that llama_server. And I try to rewrtire gguf.py like follow:

import logging
import time

import requests
from requests.exceptions import RequestException
from tqdm import tqdm

from lm_eval.api.model import LM
from lm_eval.api.registry import register_model


logger = logging.getLogger(__name__)


def get_result(logprobs, context_length):
    is_greedy = True
    continuation_logprobs = 0 
    idx = context_length + 1 

    for i in range(idx, len(logprobs)):
        current_probs = logprobs[i]["probs"]  
        token = logprobs[i]["content"]  

        continuation_logprobs += current_probs[0]["prob"]

        top_token = ""
        top_rate = -1  # 
        for t in current_probs:
            if t["prob"] > top_rate:  
                top_rate = t["prob"]
                top_token = t["tok_str"]

        # 检查是否贪婪
        if top_token != token:
            is_greedy = False
            break

    return continuation_logprobs, is_greedy



@register_model("gguf", "ggml")
class GGUFLM(LM):
    def __init__(self, base_url=None, max_length=2048, **kwargs):
        super().__init__()
        self.base_url = base_url
        assert self.base_url, "must pass `base_url` to use GGUF LM!"
        self.logprobs = 10
        self.temperature = 0.0
        self.max_length = max_length

    def gguf_completion(
        self, context, continuation=None, stop=None, retries=3, delay=5, **kwargs
    ):
        print(context)
        for _ in range(retries):
            try:
                prompt = context
                request = {
                    "prompt": prompt,
                    "n_probs": self.logprobs,
                    "temperature": self.temperature,
                    "n_predict": self.max_length
                }
                if continuation:
                    prompt += continuation
                    request.update({"prompt": prompt, "max_tokens": 1, "echo": True})
                if stop is not None:
                    request["stop"] = stop
                response = requests.post(
                    f"{self.base_url}/v1/completions", json=request
                )
                response.raise_for_status()
                return response.json()
            except RequestException as e:
                logger.error(f"RequestException: {e}")
                time.sleep(delay)  # wait before retrying
        else:
            raise Exception(f"Failed to get a valid response after {retries} retries.")

    def loglikelihood(self, requests, disable_tqdm: bool = False):
        if not requests:
            return []
        res = []
        for context, continuation in tqdm(
            [req.args for req in requests], disable=disable_tqdm
        ):
            response = self.gguf_completion(context=context, continuation=continuation)
            if response and "content" in response and response["content"]:
                choice = response
                logprobs = choice["completion_probabilities"]
                if (
                    logprobs
                ):
                    logprob, is_greedy = get_result(logprobs, len(context))
                    res.append((logprob, is_greedy))
                else:
                    logger.warning(
                        "Invalid logprobs data. Expected 'logprobs' to contain 'token_logprobs' list."
                    )
            else:
                logger.error(
                    f"Invalid response for loglikelihood. Response: {response}"
                )
                assert False
        return res

    def generate_until(self, requests, disable_tqdm: bool = False):
        if not requests:
            return []

        res = []
        for request in tqdm([req.args for req in requests], disable=disable_tqdm):
            inp = request[0]
            request_args = request[1]
            until = request_args.get("until", ["</s>"])
            response = self.gguf_completion(context=inp, stop=until)
            if response and "content" in response and response["content"]:
                choice = response
                if "content" in choice:
                    generated_text = choice["content"].strip()
                    res.append(generated_text)
                else:
                    logger.error(
                        f"Invalid response for greedy_until. Response: {response}"
                    )
                    res.append(None)  # Add default value in case of error
            else:
                logger.error(f"Invalid response for greedy_until. Response: {response}")
                res.append(None)  # Add default value in case of error
        return res

    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
        raise NotImplementedError(
            "loglikelihood_rolling not yet supported for GGUF models"
        )

But, it maybe wrong with loglikelihood, I run below code and get accuracy just 0.49

import json
from lm_eval.models.gguf import GGUFLM
from lm_eval import simple_evaluate

lm = GGUFLM(base_url="http://localhost:8080")    
# tasks.initialize_tasks()
results = simple_evaluate(model=lm,tasks=["piqa"]) 
#将results中的数据导出到json文件中  
filtered_results = results.copy()  
filtered_results = {key: value for key, value in results.items() if key != "samples"}  
json_filtered_results = json.dumps(filtered_results, indent=4)  
with open("results.json", "w") as json_file:
    json_file.write(json_filtered_results)

result:

{
    "results": {
        "piqa": {
            "acc,none": 0.4923830250272035,
            "acc_stderr,none": 0.011664470424044978,
            "acc_norm,none": 0.4923830250272035,
            "acc_norm_stderr,none": 0.011664470424044978,
            "alias": "piqa"
        }
    },
    "group_subtasks": {
        "piqa": []
    },
    "configs": {
        "piqa": {
            "task": "piqa",
            "dataset_path": "piqa",
            "training_split": "train",
            "validation_split": "validation",
            "doc_to_text": "Question: {{goal}}\nAnswer:",
            "doc_to_target": "label",
            "doc_to_choice": "{{[sol1, sol2]}}",
            "description": "",
            "target_delimiter": " ",
            "fewshot_delimiter": "\n\n",
            "num_fewshot": 0,
            "metric_list": [
                {
                    "metric": "acc",
                    "aggregation": "mean",
                    "higher_is_better": true
                },
                {
                    "metric": "acc_norm",
                    "aggregation": "mean",
                    "higher_is_better": true
                }
            ],
            "output_type": "multiple_choice",
            "repeats": 1,
            "should_decontaminate": true,
            "doc_to_decontamination_query": "goal",
            "metadata": {
                "version": 1.0
            }
        }
    },
    "versions": {
        "piqa": 1.0
    },
    "n-shot": {
        "piqa": 0
    },
    "config": {
        "model": "GGUFLM",
        "model_args": null,
        "batch_size": null,
        "batch_sizes": [],
        "device": null,
        "use_cache": null,
        "limit": null,
        "bootstrap_iters": 100000,
        "gen_kwargs": null
    },
    "git_hash": "4600d6bf",
    "date": 1711460633.7369168,
    "pretty_env_info": "PyTorch version: 2.2.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 20.04.6 LTS (aarch64)\nGCC version: (GCC) 11.2.0\nClang version: Could not collect\nCMake version: version 3.16.3\nLibc version: glibc-2.31\n\nPython version: 3.10.14 (main, Mar 21 2024, 16:18:23) [GCC 11.2.0] (64-bit runtime)\nPython platform: Linux-5.4.0-153-generic-aarch64-with-glibc2.31\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture:                    aarch64\nCPU op-mode(s):                  32-bit, 64-bit\nByte Order:                      Little Endian\nCPU(s):                          8\nOn-line CPU(s) list:             0-7\nThread(s) per core:              1\nCore(s) per socket:              8\nSocket(s):                       1\nNUMA node(s):                    1\nVendor ID:                       ARM\nModel:                           0\nStepping:                        r0p0\nCPU max MHz:                     3000.0000\nCPU min MHz:                     3000.0000\nBogoMIPS:                        100.00\nL1d cache:                       512 KiB\nL1i cache:                       512 KiB\nL2 cache:                        8 MiB\nL3 cache:                        64 MiB\nNUMA node0 CPU(s):               0-7\nVulnerability Itlb multihit:     Not affected\nVulnerability L1tf:              Not affected\nVulnerability Mds:               Not affected\nVulnerability Meltdown:          Not affected\nVulnerability Mmio stale data:   Not affected\nVulnerability Retbleed:          Not affected\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:        Mitigation; __user pointer sanitization\nVulnerability Spectre v2:        Mitigation; CSV2, BHB\nVulnerability Srbds:             Not affected\nVulnerability Tsx async abort:   Not affected\nFlags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] torch==2.2.1\n[conda] numpy                     1.26.4                   pypi_0    pypi\n[conda] torch                     2.2.1                    pypi_0    pypi",
    "transformers_version": "4.39.1",
    "upper_git_hash": null
}

Mar 27 '24 02:03 gerayking

hi, I met the same problem as yours, have you solved this problem?

Apr 07 '24 11:04 L1-M1ng

What model here is being evaluated? I will try to look into this soon.

Apr 07 '24 13:04 haileyschoelkopf

What model here is being evaluated? I will try to look into this soon.

Qwen1.8B I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?

Apr 07 '24 15:04 gerayking

hi, I met the same problem as yours, have you solved this problem?

try to use llama-cpp-python replace llama.cpp start server

Apr 07 '24 15:04 gerayking

hi, I met the same problem as yours, have you solved this problem?

try to use llama-cpp-python replace llama.cpp start server

Thanks for your reply, I' ll try it later.

Apr 08 '24 03:04 L1-M1ng

What model here is being evaluated? I will try to look into this soon.

Qwen1.8B I discovered that the original script gguf.py, when matched with llama-cpp-python to start the server, is operational. However, the accuracy ([acc, none]) is only 0.68, which might indicate an error. Could you test it to confirm?

I found that the way to compute log_logits_sum between gguf.py and huggingface.py are different. In huggingface.py, it obtain log-probs at the corresponding continuation token indices. but in gguf.py, it just compute the sum of the prob of model output token, do not select the prob of token by continuation token indices. Am I right to understand it this way? @haileyschoelkopf

Apr 08 '24 04:04 L1-M1ng

In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while loop in the screenshot skips any context tokens).

I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa accuracy in the huggingface implementation?

Apr 08 '24 12:04 haileyschoelkopf

In GGUF it should determine based on the offsets which tokens are part of the continuation and which are not (the while loop in the screenshot skips any context tokens).

I've definitely run this and gotten equivalent performance on a Llama-7B model compared against HF at the time... what is Qwen-1.8B's piqa accuracy in the huggingface implementation?

Thank you for your assistance. I've tested Qwen-1.8B's accuracy on PIQA and found it to be 0.68, which I am not sure it is right?

Apr 08 '24 13:04 gerayking

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Clarification on API Endpoint: /v1/completions vs /v1/chat/completions

lm-evaluation-harness
lm-evaluation-harness copied to clipboard