nm-vllm [CI/Build] Basic server correctness test

Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with AutoModelForCausalLM.from_pretrained().

Updates HfRunner() to accept a HuggingFace access token to be able to retrieve models that are restricted access.

The new HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text). This included a new _decode_token_by_position_index() method to properly calculate the token string by using a lookback on the generated tokens list.

Enhances the output of the check_logprobs_close() function to provide more details about the failing tokens.

Adds the test to the appropriate skip-*.txt files so that this long running test won’t get automatically run during automatic dev push workflows.

To run this test manually; [assumes that you’ve downloaded and installed the local nm-vllm package with pip install -e .[sparse] and all of the packages from requirements-common.txt, reqirements-cuda.txt, and requirements-dev.txt]

Define the HF_TOKEN environment variable with a valid HuggingFace access token
cd to the nm-vllm directory
Run the test with the command: -- python3 -m pytest --forked tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server

[note that running this from my local env I needed to include the “--import-mode importlib“ option to workaround a known issue in vllm]

May 13 '24 14:05 derekk-nm

This test is failing today. Something's been broken over the weekend. The exception is:

==== server startup command args ====
--model mistralai/Mistral-7B-Instruct-v0.2 --max-model-len 4096 --disable-log-requests --tensor-parallel-size 2 --dtype half
====
(ServerRunner pid=1801782) Traceback (most recent call last):
(ServerRunner pid=1801782)   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(ServerRunner pid=1801782)     return _run_code(code, main_globals, None,
(ServerRunner pid=1801782)   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
(ServerRunner pid=1801782)     exec(code, run_globals)
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/entrypoints/openai/api_server.py", line 23, in <module>
(ServerRunner pid=1801782)     from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/entrypoints/openai/serving_chat.py", line 15, in <module>
(ServerRunner pid=1801782)     from vllm.model_executor.guided_decoding import (
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/model_executor/guided_decoding/__init__.py", line 5, in <module>
(ServerRunner pid=1801782)     from vllm.model_executor.guided_decoding.lm_format_enforcer_decoding import (
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 5, in <module>
(ServerRunner pid=1801782)     from lmformatenforcer import (CharacterLevelParser, JsonSchemaParser,
(ServerRunner pid=1801782) ModuleNotFoundError: No module named 'lmformatenforcer'

May 13 '24 14:05 derekk-nm

I don't understand why the build was skipped. I didn't try to skip it.

May 13 '24 14:05 derekk-nm

A couple of notes:

The one test failure in the remote-push job is an intermittent marlin-related failure
I ran these tests for the magic-wand and nm-vllm RCs for release testing and they both passed

May 17 '24 19:05 dbarbuzzi

After rebasing this branch onto main, the test is passing for me with the single Mistral model:

/root/pyvenv/nmv1/bin/python3 -m pytest --forked --import-mode importlib tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server 
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /network/derekk/testdev1/nm-vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, forked-1.6.0, anyio-4.3.0, shard-0.1.2, asyncio-0.23.7
asyncio: mode=strict
collected 2 items
Running 2 items in this shard

tests/basic_correctness/test_basic_server_correctness.py ..              [100%]

======================== 2 passed in 767.36s (0:12:47) =========================

May 20 '24 13:05 derekk-nm

Per Slack discussions, I've updated the test to include most of the remaining models in the test execution (some need to be skipped if the model requires a GPU device capability greater than that available on the GPU under test). It was also necessary to ignore "special tokens" output by the HuggingFace runner for a few prompts in a number of models. The practice to simply convert any special token to an empty string worked for all but one test:

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /network/derekk/testdev1/nm-vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, forked-1.6.0, anyio-4.3.0, shard-0.1.2, asyncio-0.23.7
asyncio: mode=strict
collected 20 items
Running 20 items in this shard

tests/basic_correctness/test_basic_server_correctness.py ......Fsss..... [ 75%]
.Fsss                                                                    [100%]

....
=========================== short test summary info ============================
FAILED tests/basic_correctness/test_basic_server_correctness.py::test_models_on_server[None-3-32-microsoft/phi-2-2048-None-None]
FAILED tests/basic_correctness/test_basic_server_correctness.py::test_models_on_server[2-3-32-microsoft/phi-2-2048-None-None]
============= 2 failed, 12 passed, 6 skipped in 6544.64s (1:49:04) =============

the failure is the same for both executions with the same model:

E                   AssertionError: hf_model token '! Here’' not in [['’', '‘', '”']]
E                   prompt index 23, token index 4:
E                   hf_model:	'Absolutely! Here! Here’s an updated version of the essay that includes a few more anecdotes:\n\n<|im_start|>user\nWrite a'
E                   vllm_model:	'Absolutely! Here’s an updated version of the essay that includes a few more anecdotes:\n\nMy friendship with Sarah began in the tenth grade, during'

The HuggingFace response in this case w/out the hack had this error:

E                   AssertionError: hf_model token '�' not in [['', "'s", ' are']]
E                   prompt index 23, token index 3:
E                   hf_model:	'Absolutely! Here�! Here’s an updated version of the essay that includes a few more anecdotes:\n\n<|im_start|>user\nWrite a'
E                   vllm_model:	'Absolutely! Here’s an updated version of the essay that includes a few more anecdotes:\n\nI met Sarah in the tenth grade during a challenging time'

So, it's not really related to the special token

May 22 '24 12:05 derekk-nm

I've rebased this to latest nm-vllm/main. At this point, the test includes a number of models, but skips a few that don't work w/ HuggingFace out of the box, and one that fails the test for a specific prompt. I've got Asana tickets to address these later, so that we can get this committed and running now.

May 28 '24 15:05 derekk-nm

@derekk-nm could you add a README in "neuralmagic" or "neuralmagic/tests" that outlines:

the goal of these tests (this can be rather brief, but should be enough for other folks to understand)
how to add remove models

May 28 '24 15:05 andy-neuma