lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

update gguf backend to use Chat-completion API

Open falkbene opened this issue 7 months ago • 1 comments
trafficstars

The response structure for the logprobs of the /completion API was changed here: https://github.com/ggml-org/llama.cpp/commit/57bb2c40cd94c5a09f5210ed8264cc93b21c4b7e. Furthermore, /completion API is now Legacy ( https://platform.openai.com/docs/guides/completions). This commit adapts the gguf backend to utilize the /chat/completion API and now handles the logprobs response correctly. Moreover, this resolves an issue https://github.com/ggml-org/llama.cpp/issues/12591 where the llama-server did not recognize the echo parameter, as it is no longer necessary.

falkbene avatar Mar 28 '25 12:03 falkbene

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Mar 28 '25 12:03 CLAassistant

Hi! We should still keep the completions api as long as GGUF is supporting it. Otherwise will have to chat format the prompt for base models as well.

baberabb avatar Apr 04 '25 15:04 baberabb

There is an issue with the current implementation as I pointed out in the issue mentioned in my last message. Starting a llama-server with the newest llama.cpp does not support the echo parameter anymore which is accessed in the lm-eval gguf file that I modified. Furthermore, the response structure of the logprobs that is expected in lm_eval/models/gguf.py was also changed in an update of Llama.cpp ( see last comment). So the current gguf implementation of LM-Evaluation-Harness throws errors when I use it. My edits should fix that at least for the gguf file, we could also use the completions API, but need to adapt the expected response structure.

falkbene avatar Apr 06 '25 08:04 falkbene

I think we could still use the completions API, but still have to adapt to the response coming from the server, as the response structure has changed.

falkbene avatar Apr 11 '25 19:04 falkbene