llama.cpp
llama.cpp copied to clipboard
server: init functional tests
Motivation
Tests has been listed in #4216 as an improvements request.
The idea is to ensure all server routes are working properly using the Gherkin language to define test cases following BDD approach. It is designed to be human-readable, and describes use cases relating to a software system.
Example
@llama.cpp
Scenario: Multi users
Given a prompt:
"""
Write a very long story about AI.
"""
And a prompt:
"""
Write another very long music lyrics.
"""
And 32 max tokens to predict
Given concurrent completion requests
Then the server is busy
And all slots are busy
Then the server is idle
And all slots are idle
Then all prompts are predicted
Proposed changes
A CI workflow is triggered which builds and starts the server in background, then test scenario are launched with python.
A very small model is used to quickly generate responses and a fixed seed
is set to ensure reproducibility.
The Gherkin glue is written in python using behave.
Restriction
This is not designed to assess performance of the server.
Expected scenario:
(from @ngxson comment)
- [x] health and slots endpoints
- [x] completion endpoint
- [x] OAI compatible chat completion requests w/ and without streaming
- [x] passing multi users scenario
- [x] multi users scenario on OAI compatible endpoint with streaming
- [x] multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969
- [x] slots shifting and continuous batching
- [x] embeddings endpoint
- [ ] embeddings endpoint with image
- [ ] multi user embedding endpoint: #5655
- [x] OpenAI-compatible embeddings API
- [x] tokenize endpoint
- [ ] infill endpoint
- [x] CORS and api key scenario @Azeirah
- [ ] Upon receiving an non-completed unicode, the json converter crashes the whole process. @ngxson
Example of passing GitHub workflow can be found here.
Reproduced issues
- #3969
- #3287
Status
Reverting to draft; as it will be more flexible to start server with custom option for each scenario.
Issues/enhancements to be fixed before merging:
- [ ] Use different build cmake config of the server to run test, for example with -DLLAMA_SANITIZE_THREAD=ON
- [x] start the server within the scenario background instead of starting it manually to pass custom parameters
- [x] make the server binary path configurable
- [x] fix
/slots
and/health
endpoint to properly access slots data over thequeue_tasks
: #5634 - [x] fix slots[].state in health endpoint may be incoherent with the total slots idle under race condition
- [x] change the ci build trigger
- [x] use asyncio and aiohttp to trigger concurrent http request
- [ ] fix async openai http client with streaming.
I'm quite out of depth here, but if you can figure out a way to add server
tests it would be awesome. I've sent you a collaborator invite
@RafaAguilar @ngxson As you were part of the tests related discussion, do you feel OK with the proposed approach here ? If so, I will continue with asynchronous request and multi users scenario
Great idea, thanks for starting this PR. Some suggestions:
- Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
- Would be nice if we have a .sh script that runs
server -m ...
andpython -m behave
all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future. - ~~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently
tinyllama-2-1b-miniguanaco.Q2_K.gguf
~~ Tried finetuningahxt/LiteLlama-460M-1T
but the result is unusable.
Excellent! I'll be reading this PR today and see if I can add a test or help out in some way :)
Great idea, thanks for starting this PR. Some suggestions:
- Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
- Would be nice if we have a .sh script that runs
server -m ...
andpython -m behave
all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.- ~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently
tinyllama-2-1b-miniguanaco.Q2_K.gguf
~ Tried finetuningahxt/LiteLlama-460M-1T
but the result is unusable.
Would it be possible to train a nonsensical 1m param model potentially? It should be really cheap and fast even on commodity hardware. These tests aren't meant to interact in any meaningful way with the output anyway.
I'm mentioning this because I see the trial run ran for 15 minutes for just two features with 3 scenarios each! Imagine the time needed to run 20-30 tests!
@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict
to stop the generation after X tokens.
Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab.
Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.
@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the
n_predict
to stop the generation after X tokens. Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab. Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.
Fair enough, I think it should be doable to make a model that behaves well enough. Potentially it could be trained explicitly to bias EOS whaha. I agree it would be a fun exercise, unfortunately I have a 7900xtx and I believe it cannot be used to train :(
In addition to that, we of course have no clue what kind of hardware these tests will be ran on, but if it's a virtual core on a xeon or some or other maybe we can try compiling openBLAS? I'm not sure if it'd be even worth investigating depending on the speedup and the variety of weird hardware you could get on Github actions. No clue what kind of control over the underlying (virtualised) hardware you'd get there.
Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it.
Also one case that I have never tested before is invalid unicode.
In my personal project (which uses llama.h), on receiving responses via llama_token_to_piece
, I pass it to nlohmann/json
to convert it to json string. That's the same thing we're using in server example. Upon receiving an non-completed unicode, the json converter crashes the whole process.
Would be nice if someone can test if it's the case for server.cpp (which stream=True
for example)
@Azeirah I believe the hosted runner of github is Xeon with shared CPU cores. The performance is not meant to be consistent though. I believe that it cannot use anything better than AVX2.
For training, I'm using a GTX 1660 Ti. I initially purchased it for gaming 2 years ago, but who knows that now I need more VRAM than that :'( Back then, the dealer proposed me a 3080 Ti with a fairly good price, but I refused. Nowadays, for anything bigger than 1B, I need to rent a VPS on google cloud, it's more or less the same price with colab notebooks, but more flexible and have persistent storage.
Great idea, thanks for starting this PR. Some suggestions:
- Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
- Would be nice if we have a .sh script that runs
server -m ...
andpython -m behave
all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.- ~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently
tinyllama-2-1b-miniguanaco.Q2_K.gguf
~ Tried finetuningahxt/LiteLlama-460M-1T
but the result is unusable.
Done :+1:
@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?
@phymbert Can you try this model instead? (pay attention to set n_predict
or max_tokens
, because the model never outputs EOS token)
https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin
I have no idea if OpenBLAS will help. You can try if you want.
@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB
@phymbert Can you try this model instead? (pay attention to set
n_predict
ormax_tokens
, because the model never outputs EOS token)https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin
I have no idea if OpenBLAS will help. You can try if you want.
@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB
Nice thanks Took 0m0.481s. Note I have also reduced the KV Size.
@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?
@phymbert
Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas
Here are instructions for converting to GGUF and using in llama.cpp
:
https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml
For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here:
https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf
Example:
# get the model
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf
# run sample inference
./main -m ./stories260K.gguf -p "One day, Lily met" -n 128 -c 256
One day, Lily met a boy named Timmy. Tim was very happy to help her mommy. He wanted to play with the ball all day. Suddenly, something unexpected happened. A little girl came over and saw a big tree. She was very sad.
Timmy wanted to play with the ball. He thought it would be fun! When he reached up, he found it st
llama_print_timings: load time = 80.26 ms
llama_print_timings: sample time = 1.70 ms / 128 runs ( 0.01 ms per token, 75427.22 tokens per second)
llama_print_timings: prompt eval time = 3.06 ms / 7 tokens ( 0.44 ms per token, 2288.33 tokens per second)
llama_print_timings: eval time = 134.04 ms / 127 runs ( 1.06 ms per token, 947.46 tokens per second)
llama_print_timings: total time = 142.59 ms / 134 tokens
This should be ideal for CI
Btw, one thing that would greatly improve the state of server
in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log
, but there should be more.
Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance
For a recent PR I did, it would be nice to confirm credentialed CORS keeps working correctly too, the idea is that if you add an authorization header to your request, CORS needs to behave in a specific way.
So the tests would be something like
Scenario Outline: Credentialed CORS requests
Given a <request> with an api token
Then 200 OK is given back
@step(u'a {request} with an api token')
def step_a_prompt(context):
# add the following authorization header header to context
# make sure headers are sent with requests
# Authorization: Bearer asdf98afbwqo8dsjlfh
^^ not working, just approximate pseudocode for what the test could look like
Best for all OpenAI endpoints, but just the completion endpoint alone is most important.
Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas
Here are instructions for converting to GGUF and using in
llama.cpp
:https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml
For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here:
https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf
This is outside the scope of this PR, but if we have such small models we should probably place them directly in the Git repository and have them run as a standard CI test. We could go through the full quantization and inference flow for each quant type and confirm that the output is correct using a fixed seed.
@Azeirah Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it.
@ngxson @Azeirah I finally split files per feature, as it is a cucumber Gherkin requirement. In order to definer server option in the the Background
steps. Also the environment.py
is expected by behave for the before/after hooks.
@ggerganov I suggest to go with this first version and see how it behaves on master.
Note: sorry I messed up with github CI workflow and triggered lot of jobs on your repo actions. I am cancelling them...
@ngxson @Azeirah Happy to have your review
I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.
Wow! Very nice work - this would be very useful and should help to improve
server
significantlymulti users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969
What was the fix?
No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict
nor "max_tokens"
are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.
I have also added a wrong_usages.feature
file to trace and reproduce this kind user issue.
@ggerganov Regarding #5655, I have reproduced it in issues.feature
, to run it:
DEBUG=ON ./tests.sh --no-skipped --tags bug
It can be investigated/fixed in another PR.
Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.
I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.
@Azeirah No worries, take care, it can wait for tomorrow :+1:
Wow! Very nice work - this would be very useful and should help to improve
server
significantlymulti users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969
What was the fix?
No fix was applied actually, IMHO it's a wrong usage of the server when neither
--n_predict
nor"max_tokens"
are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.I have also added a
wrong_usages.feature
file to trace and reproduce this kind user issue.@ggerganov Regarding #5655, I have reproduced it in
issues.feature
, to run it:DEBUG=ON ./tests.sh --no-skipped --tags bug
It can be investigated/fixed in another PR.
Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.
In the case the server is started with undesirable parameters, we should either abort or at the very least offer a clear warning with a suggested solution. Is that the case now?
I try to focus a lot on usability for end users.
Btw, one thing that would greatly improve the state of
server
in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved inllama.log
, but there should be more.Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance
On it, especially in update_slots
as it is a nightmare to understand what's going on