llama.cpp server: init functional tests

Motivation

Tests has been listed in #4216 as an improvements request.

The idea is to ensure all server routes are working properly using the Gherkin language to define test cases following BDD approach. It is designed to be human-readable, and describes use cases relating to a software system.

Example

  @llama.cpp
  Scenario: Multi users
    Given a prompt:
      """
      Write a very long story about AI.
      """
    And a prompt:
      """
      Write another very long music lyrics.
      """
    And 32 max tokens to predict
    Given concurrent completion requests
    Then the server is busy
    And  all slots are busy
    Then the server is idle
    And  all slots are idle
    Then all prompts are predicted

Proposed changes

A CI workflow is triggered which builds and starts the server in background, then test scenario are launched with python. A very small model is used to quickly generate responses and a fixed seed is set to ensure reproducibility.

The Gherkin glue is written in python using behave.

Restriction

This is not designed to assess performance of the server.

Expected scenario:

(from @ngxson comment)

[x] health and slots endpoints
[x] completion endpoint
[x] OAI compatible chat completion requests w/ and without streaming
[x] passing multi users scenario
[x] multi users scenario on OAI compatible endpoint with streaming
[x] multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969
[x] slots shifting and continuous batching
[x] embeddings endpoint
[ ] embeddings endpoint with image
[ ] multi user embedding endpoint: #5655
[x] OpenAI-compatible embeddings API
[x] tokenize endpoint
[ ] infill endpoint
[x] CORS and api key scenario @Azeirah
[ ] Upon receiving an non-completed unicode, the json converter crashes the whole process. @ngxson

Example of passing GitHub workflow can be found here.

Reproduced issues

#3969
#3287

Status

Reverting to draft; as it will be more flexible to start server with custom option for each scenario.

Issues/enhancements to be fixed before merging:

[ ] Use different build cmake config of the server to run test, for example with -DLLAMA_SANITIZE_THREAD=ON
[x] start the server within the scenario background instead of starting it manually to pass custom parameters
[x] make the server binary path configurable
[x] fix /slots and /health endpoint to properly access slots data over the queue_tasks: #5634
[x] fix slots[].state in health endpoint may be incoherent with the total slots idle under race condition
[x] change the ci build trigger
[x] use asyncio and aiohttp to trigger concurrent http request
[ ] fix async openai http client with streaming.

Feb 18 '24 17:02 phymbert

I'm quite out of depth here, but if you can figure out a way to add server tests it would be awesome. I've sent you a collaborator invite

Feb 18 '24 18:02 ggerganov

@RafaAguilar @ngxson As you were part of the tests related discussion, do you feel OK with the proposed approach here ? If so, I will continue with asynchronous request and multi users scenario

Feb 18 '24 20:02 phymbert

Great idea, thanks for starting this PR. Some suggestions:

Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)
Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.
~~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf~~ Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Feb 18 '24 21:02 ngxson

Excellent! I'll be reading this PR today and see if I can add a test or help out in some way :)

Feb 19 '24 10:02 Azeirah

Great idea, thanks for starting this PR. Some suggestions:

Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)

Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.

~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf~ Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Would it be possible to train a nonsensical 1m param model potentially? It should be really cheap and fast even on commodity hardware. These tests aren't meant to interact in any meaningful way with the output anyway.

I'm mentioning this because I see the trial run ran for 15 minutes for just two features with 3 scenarios each! Imagine the time needed to run 20-30 tests!

Feb 19 '24 15:02 Azeirah

@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict to stop the generation after X tokens. Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab. Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.

Feb 19 '24 15:02 ngxson

@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the n_predict to stop the generation after X tokens. Another problem is that small model tends to output invalid bytes instead of words (because part of llama vocab is bytes, which allow it to do unicode). Maybe I need to limit the usable tokens in its vocab. Anyway, I'll look into this this week, it's still a good exercise for me to train a model from zero.

Fair enough, I think it should be doable to make a model that behaves well enough. Potentially it could be trained explicitly to bias EOS whaha. I agree it would be a fun exercise, unfortunately I have a 7900xtx and I believe it cannot be used to train :(

In addition to that, we of course have no clue what kind of hardware these tests will be ran on, but if it's a virtual core on a xeon or some or other maybe we can try compiling openBLAS? I'm not sure if it'd be even worth investigating depending on the speedup and the variety of weird hardware you could get on Github actions. No clue what kind of control over the underlying (virtualised) hardware you'd get there.

Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it.

Feb 19 '24 15:02 Azeirah

Also one case that I have never tested before is invalid unicode.

In my personal project (which uses llama.h), on receiving responses via llama_token_to_piece, I pass it to nlohmann/json to convert it to json string. That's the same thing we're using in server example. Upon receiving an non-completed unicode, the json converter crashes the whole process.

Would be nice if someone can test if it's the case for server.cpp (which stream=True for example)

Feb 19 '24 15:02 ngxson

@Azeirah I believe the hosted runner of github is Xeon with shared CPU cores. The performance is not meant to be consistent though. I believe that it cannot use anything better than AVX2.

For training, I'm using a GTX 1660 Ti. I initially purchased it for gaming 2 years ago, but who knows that now I need more VRAM than that :'( Back then, the dealer proposed me a 3080 Ti with a fairly good price, but I refused. Nowadays, for anything bigger than 1B, I need to rent a VPS on google cloud, it's more or less the same price with colab notebooks, but more flexible and have persistent storage.

Feb 19 '24 15:02 ngxson

Great idea, thanks for starting this PR. Some suggestions:

Since the number of test cases is not very big, can we reduce number of files? (so that future contributors can find things more easily)

Would be nice if we have a .sh script that runs server -m ... and python -m behave all at once. It's again to be easier for future contributors. It can also be useful when we want to dockerize the test script in the future.

~Note to myself: maybe I can fine tune the bloom 560m model for using in this test. The smallest usable gguf that we can find on hf is currently tinyllama-2-1b-miniguanaco.Q2_K.gguf~ Tried finetuning ahxt/LiteLlama-460M-1T but the result is unusable.

Done :+1:

Feb 19 '24 20:02 phymbert

@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?

Feb 19 '24 20:02 phymbert

@phymbert Can you try this model instead? (pay attention to set n_predict or max_tokens, because the model never outputs EOS token)

https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin

I have no idea if OpenBLAS will help. You can try if you want.

@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB

Feb 19 '24 21:02 ngxson

@phymbert Can you try this model instead? (pay attention to set n_predict or max_tokens, because the model never outputs EOS token)

https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin

I have no idea if OpenBLAS will help. You can try if you want.

@Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB

Nice thanks Took 0m0.481s. Note I have also reduced the KV Size.

Feb 19 '24 22:02 phymbert

@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ?

@phymbert

Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas

Here are instructions for converting to GGUF and using in llama.cpp:

https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml

For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here:

https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf

Example:

# get the model
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf

# run sample inference
./main -m ./stories260K.gguf -p "One day, Lily met" -n 128 -c 256

 One day, Lily met a boy named Timmy. Tim was very happy to help her mommy. He wanted to play with the ball all day. Suddenly, something unexpected happened. A little girl came over and saw a big tree. She was very sad.
Timmy wanted to play with the ball. He thought it would be fun! When he reached up, he found it st

llama_print_timings:        load time =      80.26 ms
llama_print_timings:      sample time =       1.70 ms /   128 runs   (    0.01 ms per token, 75427.22 tokens per second)
llama_print_timings: prompt eval time =       3.06 ms /     7 tokens (    0.44 ms per token,  2288.33 tokens per second)
llama_print_timings:        eval time =     134.04 ms /   127 runs   (    1.06 ms per token,   947.46 tokens per second)
llama_print_timings:       total time =     142.59 ms /   134 tokens

This should be ideal for CI

Feb 20 '24 09:02 ggerganov

Btw, one thing that would greatly improve the state of server in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log, but there should be more.

Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance

Feb 20 '24 10:02 ggerganov

For a recent PR I did, it would be nice to confirm credentialed CORS keeps working correctly too, the idea is that if you add an authorization header to your request, CORS needs to behave in a specific way.

So the tests would be something like

Scenario Outline: Credentialed CORS requests
     Given a <request> with an api token
     Then 200 OK is given back

@step(u'a {request} with an api token')
def step_a_prompt(context):
     # add the following authorization header header to context
     # make sure headers are sent with requests
     # Authorization: Bearer asdf98afbwqo8dsjlfh

^^ not working, just approximate pseudocode for what the test could look like

Best for all OpenAI endpoints, but just the completion endpoint alone is most important.

Feb 20 '24 20:02 Azeirah

Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas

Here are instructions for converting to GGUF and using in llama.cpp:

https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml

For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here:

https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf

This is outside the scope of this PR, but if we have such small models we should probably place them directly in the Git repository and have them run as a standard CI test. We could go through the full quantization and inference flow for each quant type and confirm that the output is correct using a fixed seed.

Feb 20 '24 21:02 netrunnereve

@Azeirah Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it.

@ngxson @Azeirah I finally split files per feature, as it is a cucumber Gherkin requirement. In order to definer server option in the the Background steps. Also the environment.py is expected by behave for the before/after hooks.

Feb 22 '24 23:02 phymbert

@ggerganov I suggest to go with this first version and see how it behaves on master.

Note: sorry I messed up with github CI workflow and triggered lot of jobs on your repo actions. I am cancelling them...

@ngxson @Azeirah Happy to have your review

Feb 23 '24 12:02 phymbert

I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.

Feb 23 '24 17:02 Azeirah

Wow! Very nice work - this would be very useful and should help to improve server significantly

multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969

What was the fix?

No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict nor "max_tokens" are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.

I have also added a wrong_usages.feature file to trace and reproduce this kind user issue.

@ggerganov Regarding #5655, I have reproduced it in issues.feature, to run it: DEBUG=ON ./tests.sh --no-skipped --tags bug

It can be investigated/fixed in another PR.

Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.

Feb 23 '24 17:02 phymbert

I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out.

@Azeirah No worries, take care, it can wait for tomorrow :+1:

Feb 23 '24 18:02 phymbert

Wow! Very nice work - this would be very useful and should help to improve server significantly

multi users with total number of tokens to predict exceeds the KV Cache size, fixed to be confirmed: #3969

What was the fix?

No fix was applied actually, IMHO it's a wrong usage of the server when neither --n_predict nor "max_tokens" are set. If you provide n_predict|max_tokens in the request, the server behaves well. I have updated the PR description as it was confusing. But IMHO, a server should never infinite loop.

I have also added a wrong_usages.feature file to trace and reproduce this kind user issue.

@ggerganov Regarding #5655, I have reproduced it in issues.feature, to run it: DEBUG=ON ./tests.sh --no-skipped --tags bug

It can be investigated/fixed in another PR.

Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version.

In the case the server is started with undesirable parameters, we should either abort or at the very least offer a clear warning with a suggested solution. Is that the case now?

I try to focus a lot on usability for end users.

Feb 23 '24 20:02 Azeirah

Btw, one thing that would greatly improve the state of server in terms of debugging issues is to add detailed logs. Things like incoming requests, parameters, batch info, etc. As much information as possible should be dumped in the log file. There is some info currently saved in llama.log, but there should be more.

Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance

On it, especially in update_slots as it is a nightmare to understand what's going on

Feb 24 '24 11:02 phymbert

llama.cpp llama.cpp copied to clipboard

server: init functional tests

Motivation

Example

Proposed changes

Restriction

Expected scenario:

Reproduced issues

Status

llama.cpp
llama.cpp copied to clipboard