WIP: basic correctness test
Introducing an end-to-end test case that verifies basic correctness of the vllm engine by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created by AutoModelForCausalLM.from_pretrained().
This Work In Progress PR is intended to verify the approach used to reconcile the different logprobs output generated by the server (keyed by tokens) vs that generated by the HF model (keyed by token id). It uses duplicate code in the tests.conftest.HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() method, replacing a token id with the actual token. As long as this approach is approved, I'll come up with some code refactoring to reduce this, and other code duplication used in other fixtures in conftest.py.
The use of the new VllmServer class is also an open question as we are still determining the actual infrastructure that will be used to execute this test. [credit to Domenic Barbuzzi for the VllmServer code]
Thanks @derekk-nm, I think the new test is looking good!
# now repeat using two gpus # specifically doing it here, rather than as a pytest param, # to avoid repeating the huggingface inference and data collection
If this test isn’t getting forked through pytest-forked or similar, I can work on a cached helper function to generate the referenced outputs, so using 1 vs. 2 GPUs can be separate tests while still only incurring the cost of the HF output generation once.
Another open question to resolve w/ this code is the handling of spaces in tokens.
here's an image from my debugger showing that huggingface generated text does not have spaces in the tokens, but the openai server's response does.
(I did try using clean_up_tokenization_spaces=False on the huggingface tokenizer's decode method when I retrieved the token's string, but that didn't help).
Also, another image of an "empty" response from one of the other prompts that includes that marker. we'll have to deal w/ this too.
Thanks @derekk-nm, I think the new test is looking good!
# now repeat using two gpus # specifically doing it here, rather than as a pytest param, # to avoid repeating the huggingface inference and data collectionIf this test isn’t getting forked through
pytest-forkedor similar, I can work on a cached helper function to generate the referenced outputs, so using 1 vs. 2 GPUs can be separate tests while still only incurring the cost of the HF output generation once.
As far as I can tell, the tests are currently run with --forked, at least the existing tests. I don't know if we'll end up with some different architecture to run end-to-end tests.
As far as I can tell, the tests are currently run with --forked, at least the existing tests. I don't know if we'll end up with some different architecture to run end-to-end tests.
In that case, maybe we can review tests so only tests that require being forked are (by using the pytest marker instead of the CLI flag) or otherwise splitting into multiple runs to be able to not fork this one. Overall, I don’t think such an enhancement is required for the initial implementation.
@derekk-nm I added a couple of comments about the arg list. Per a previous discussion, I left the initial implementation in where the args dict passed to the VllmServer constructor expects all keys and values to be strings, and should actually be throwing errors in your tests due to some of the values in the dicts being ints, floats, and None.
So, summarizing some comments:
- If the test and server helper are working as-is, something is going wrong with input validation
- Actually, going even further, even if the built-in error-checking is not passing, the "command" list passed to something like
Popenbasically requires strings (technically also bytes or Pathlike objects) so no matter what, passing ints orNoneas values should not actually work…
- Actually, going even further, even if the built-in error-checking is not passing, the "command" list passed to something like
- All that said, we can certainly amend
VllmServerso that this args dict can have non-string values, that was a design question and simply requires some refactoring of the logic within the_args_to_listprivate method (no other external usage changes).
Should this --tensor-parallel-size be treated similarly as pytest.mark.parametrize for other arguments e.g. model etc? There is no need to repeat the test code here just for --tensor-parallel-size=2. Also it can be specified to different value than 2 based on the gpu type.
@dhuangnm I believe per the comments that this is currently specifically part of the same test so that the hugging-face outputs only need to be generated once (as they are needed for a baseline comparison after both server portions).
@robertgshaw2-neuralmagic , @dbarbuzzi , @dhuangnm , I've pushed a bunch of changes to this WIP PR. Key things are specified in the commit message. the things that remain are:
- figure out how to ultimately do the merge to main...pile my stuff on top of Domenic's server branch, or do these separately
- move files around per the PR comments
- identify which models will actually work with this test (many will fail due to missing optimum package)
- get the retrieve the token strings correctly
- uncomment the line that includes the
tensor_parallel_size= 2 parameter and test that it works on a k8s env that includes the necessary gpus
I believe this latest commit correctly decodes the token ids from the HFRunnerNM, but I can't find a way to decode the token ids in the logprobs keys using _decode_token_by_position_index. The call to topk = tok_logprobs.topk() (line 416 in conftest.py) then followed by the call for topk.indices[0] (line 427) is returning the actual token_ids. I can't seem to find any useful explanation of how topk.indices() is defined, but the doc that I found specifically states that topk will return the indices to the tensor. I don't know which array the logprobs come from. when I attempted to use the keys in topk.indices() as an index to seq_ids, the decoded values didn't make any sense (straight decode or passed to self._decode_token_by_position_index.
what I've done is to tokenize the top logprob keys using self.tokenizer.decode . This leaves us with the missing leading space for the huggingface output, so my compare util is stripping white space for the comparison. This is not ideal, I know, but it seems to be working. most of the prompts pass with flying colors. Certainly open to suggestions!
With that said, there are a number of prompts that result in responses that fail the "closeness" test we're using here. A manual review of the results (latest run output attached) demonstrates that conceptually, the responses are similar, but grammatically, they are not, so it's reasonable for them to fail this test. @robertgshaw2-neuralmagic , we'll need to discuss which model to choose to use temporarily. My run yesterday of all of the models in the current list of params had none pass.
This result prompted me to do additional exploration for a way to get the OpenAI server to generate token ids instead of text. ChatGPT and Groq both seem to have hallucinated some interesting args that simply aren't real (at least not for the version of OpenAI that we're using). I still don't think it's currently possible to get the openai server to generate output that matches for format of the HuggingFace runner.
re-reading other comments here, I know that I still have to move things around.
server_basic_correctness_mistral_results_202404301351.txt this attachment is the latest output from the test execution. There are two tests using the same model, one test using a single gpu, and another with 2 (via the tensor_parallel_size param). If you scroll to the lines that start with "E", you'll get to the error messages for the failing tests. As you can see, the majority of prompts successfully compare between the HF and vllm server implementations, but the multi-gpu env is more "brittle" for this test.
It's not clear to me if the issue is that the token strings in the logprobs dictionary were created by decoding the token id w/out any lookback (thus possibly missing adding a potential option in the logprobs that would have allowed the prompt to pass the test), or if there's an actual bug in the server implementation causing these prompts to generate results that do not match the HuggingFace implementation.
Thank you @robertgshaw2-neuralmagic for the consultation. I've updated the script to use the existing comparison function, and corrected the method that decodes the logprob keys. The two tests are now passing with the single model. @dbarbuzzi , let's coordinate on the location of the tests and merging of to main.