lm-evaluation-harness mlx Model (loglikelihood & generate

This adds a new model type for mlx models. In particular, it implements the loglikelihood and generate_until interfaces. Works with the current versions of mlx (mlx-0.14.0.dev) and mlx-lm (mlx-lm-0.14.0) as of this writing.

The new model type is mlx, so the harness can be run this way to evaluate against a local mlx model:

lm_eval --model mlx --model_args model=.. model name or path ..   --tasks medqa_4options

The expected model args are:

model (huggingface model or local path to mlx model)
adapter_path (path to a LoRa adapter to apply to the model)
trust_remote_code
eos_token
top_p (defaults to 1)
max_tokens (defaults to 2048)
batch_size (defaults to 4)
max_gen_tokens (defaults to 256)
ensure_bos_token (defaults to False) : Whether or not to ensure the first token is a defined BOS token

May 29 '24 21:05 chimezie

All committers have signed the CLA.

May 29 '24 21:05 CLAassistant

I'm getting the following traceback running the evaluation this way (in an environment with mlx and mlx-lm):

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
    --tasks medqa_4options \
    --batch_size 64

Traceback:

2024-05-29:13:18:14,114 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-29:13:18:16,354 INFO     [__main__.py:341] Selected Tasks: ['medqa_4options']
2024-05-29:13:18:16,355 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-29:13:18:16,355 INFO     [evaluator.py:178] Initializing mlx model, with arguments: {'model': 'internistai/base-7b-v0.2'}
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 32968.33it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-05-29:13:18:20,863 INFO     [mlx_llms.py:28] Model type is '<class 'mlx_lm.models.llama.Model'>
2024-05-29:13:18:22,781 INFO     [task.py:398] Building contexts for medqa_4options on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1273/1273 [00:00<00:00, 198223.53it/s]
2024-05-29:13:18:22,818 INFO     [evaluator.py:395] Running loglikelihood requests
Running loglikelihood requests (79 batches):  37%|███████████████████████████████████████▋                                                                    | 29/79 [10:13<15:22, 18.46s/it]Running loglikelihood requests (79 batches): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [26:40<00:00, 20.26s/it]
[..snip..]
Traceback (most recent call last):
  File "/path/to/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/path/to/lm_eval/__main__.py", line 347, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lm_eval/utils.py", line 321, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/lm_eval/evaluator.py", line 256, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/path/to/lm_eval/utils.py", line 321, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/lm_eval/evaluator.py", line 421, in evaluate
    task.apply_filters()
  File "/path/to/lm_eval/api/task.py", line 1000, in apply_filters
    f.apply(self._instances)
  File "/path/to/lm_eval/api/filter.py", line 55, in apply
    for inst, resp in zip(instances, resps):
  File "/path/to/lm_eval/filters/selection.py", line 23, in <lambda>
    return map(lambda r: r[0], resps)

The implemented loglikelihood function returns a list of 5,056 pairs of (log-likelihood, boolean). However, for some reason, the TakeFirstFilter.apply method receives a resps parameter with 5,092 resources, the last of which are empty lists, which seems to be causing the traceback.

Any help would be greatly appreciated.

May 29 '24 21:05 chimezie

However, I was able to run it against mmlu_professional_medicine:

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
>     --tasks mmlu_professional_medicine \
>     --batch_size 64
[..snip..]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|        Tasks        |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine|      0|none  |     0|acc   |0.1838|±  |0.0235|

May 30 '24 00:05 chimezie

Oddly enough, I can get a clean eval of internistai/base-7b-v0.2 against mmlu_professional_medicine tasks on MLX and then HF but still get the issue above when run against the medqa_4options task:

% time lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \                                              
    --tasks mmlu_professional_medicine \
    --batch_size 64 
2024-05-31:15:31:05,832 INFO     [evaluator.py:395] Running loglikelihood requests
Running loglikelihood requests (17 batches): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [04:55<00:00, 17.36s/it]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|        Tasks        |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine|      0|none  |     0|acc   |0.7647|±  |0.0258|

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 --tasks   64  7.96s user 35.39s system 13% cpu 5:10.00 total

Hugging Face run on the same model:

% time lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float" --tasks mmlu_professional_medicine --device mps  --batch_size 64
hf (pretrained=internistai/base-7b-v0.2,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
|        Tasks        |Version|Filter|n-shot|Metric|Value |   |Stderr|
|---------------------|------:|------|-----:|------|-----:|---|-----:|
|professional_medicine|      0|none  |     0|acc   |0.7647|±  |0.0258|

lm_eval --model hf --model_args  --tasks mmlu_professional_medicine --device   28.83s user 117.90s system 63% cpu 3:49.41 total

Jun 01 '24 03:06 chimezie

I fixed some handling of batch remainders, and it looks good; running comparisons against HF/MPS/Pytorch for medqa and some related subsets of MMLU

Jun 18 '24 01:06 chimezie

@haileyschoelkopf bringing this to your attention as well.

Jul 12 '24 07:07 lintangsutawika

Could add installation dependancies (like lm_eval[mlx] see pyproject.toml) and a way to check if library is installed when called (see lm_eval/models/anthropic_llms.py)

@lintangsutawika I have made these changes. Thanks for bringing it to my attention

Jul 13 '24 16:07 chimezie

Mistakenly closed the PR

Jul 13 '24 16:07 chimezie

Hi! thanks for the substantial PR, and sorry it took so long.

No worries

Left a couple of comments mainly about the indexing to extract the logprobs. A couple of other comments:

Thanks.

I think you can leave the tokenization to TemplateLM.loglikelihood (and move the loglikelihood logic to loglikelihood_tokens). This is mainly because we want to use encode_pair, which deals with a bug in some sentencepiece tokenizers.

Got it. Thanks. It wasn't always clear to me how to override this behavior in the least disruptive way, but this helps. I'll move this to loglikelihood_tokens.

Would also be great if you could add a test!

I will do that. Are there examples of other tests for lm_eval models that I can use to determine what conventions to follow?

Nov 22 '24 23:11 chimezie

Incorporated refactoring suggested, moving logic to _loglikelihood_tokens and deferring to HF implementation of tokenizer_name, apply_chat_template, and apply_chat_template, but getting:

024-11-22:20:23:30,303 WARNING  [huggingface.py:1353] Failed to apply chat template. removing the system role in chat history.
Traceback (most recent call last):
  File "/path/to/lm_eval/models/huggingface.py", line 1349, in apply_chat_template
    chat_templated = self.tokenizer.apply_chat_template(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1867, in apply_chat_template
    rendered_chat = compiled_template.render(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/python3.11/site-packages/jinja2/environment.py", line 1304, in render
    self.environment.handle_exception()
  File "/path/to/python3.11/site-packages/jinja2/environment.py", line 939, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 14, in top-level template code
  File "/path/to/python3.11/site-packages/jinja2/sandbox.py", line 394, in call
    return __context.call(__obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/python3.11/site-packages/transformers/utils/chat_template_utils.py", line 410, in raise_exception
    raise jinja2.exceptions.TemplateError(message)
jinja2.exceptions.TemplateError: After the optional system message, conversation roles must alternate user/assistant/user/assistant/...

During handling of the above exception, another exception occurred:
[.. etc..]

Nov 23 '24 01:11 chimezie

@baberabb See my recent updates. I've made another attempt to mimic HF model loglikelihood_tokens impl of one-token continuation caching, but getting a Key Error in re_ord.get_cache(..):

% lm_eval --model mlx --model_args eos_token="[/INST]",model=/path/to/Mistral-Nemo-model,trust_remote_code=True --tasks mmlu_clinical_knowledge --batch_size 40
2024-11-28:15:27:29,784 INFO     [__main__.py:279] Verbosity set to INFO
2024-11-28:15:27:33,477 INFO     [__main__.py:376] Selected Tasks: ['mmlu_clinical_knowledge']
2024-11-28:15:27:33,479 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-11-28:15:27:33,479 INFO     [evaluator.py:201] Initializing mlx model, with arguments: {'eos_token': '[/INST]', 'model': '..', 'trust_remote_code': True}
2024-11-28:15:27:35,547 INFO     [mlx_llms.py:50] Model type is '<class 'mlx_lm.models.llama.Model'>
2024-11-28:15:27:35,993 INFO     [task.py:415] Building contexts for mmlu_clinical_knowledge on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 265/265 [00:00<00:00, 1927.58it/s]
2024-11-28:15:27:36,137 INFO     [evaluator.py:496] Running loglikelihood requests
Running mlx loglikelihood requests (1,060):   0%|                                                                                                                    | 0/1060 [00:00<?, ?it/s]Traceback (most recent call last):
  [..snip..]

  File "/path/to/lm_eval/models/mlx_llms.py", line 249, in _loglikelihood_tokens
    for request_str, cont_toks, logits in re_ord.get_cache(
  File "/path/to/lm_eval/models/utils.py", line 484, in get_cache
    ] = self._arr_with_indices.pop(tuple(cxt_toks + cont_toks[:-1]))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: (..)

Nov 28 '24 21:11 chimezie

@baberabb I've removed all dependencies on the caching and I'm able to get similar answer log prob and greedy = continuation values for a handful of questions I probed. However, the final top-level figures still don't match, and I have run out of ideas why and wonder if the issue is at the level above _loglikelihood_tokens:

% lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56
[..snip..]
mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.2302|±  |0.0259|

% lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float32" --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56 --device mps
[..snip..]
hf (pretrained=internistai/base-7b-v0.2,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.5132|±  |0.0308|

Dec 01 '24 00:12 chimezie

I have made many updates and now have figures that seem reasonably close to those of the HF model. I reviewed log prob scores (via --log_samples) for individual answers between the two, and they were comparable as well. Prefix prompt caching was also added, and generate_until support was removed (I can add a more robust implementation in a subsequent PR).

lm_eval --model mlx --model_args model=internistai/base-7b-v0.2 \
               --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56

mlx (model=internistai/base-7b-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.4566|±  |0.0307|

lm_eval --model hf --model_args pretrained=internistai/base-7b-v0.2,dtype="float32" \
              --tasks mmlusr_question_and_answer_clinical_knowledge --batch_size 56 --device mps

hf (pretrained=internistai/base-7b-v0.2,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical knowledge|      1|none  |     0|acc   |↑  |0.5132|±  |0.0308|

% lm_eval --model mlx --model_args model=m42-health/Llama3-Med42-8B \
                   --tasks mmlu_clinical_knowledge

mlx (model=m42-health/Llama3-Med42-8B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical_knowledge|      1|none  |     0|acc   |↑  |0.7245|±  |0.0275|

% lm_eval --model hf --model_args pretrained=m42-health/Llama3-Med42-8B,dtype="float32" \
                  --tasks mmlu_clinical_knowledge --batch_size 56 --device mps

hf (pretrained=m42-health/Llama3-Med42-8B,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 56
|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|clinical_knowledge|      1|none  |     0|acc   |↑  |0.7547|±  |0.0265|

Dec 05 '24 16:12 chimezie

lm-evaluation-harness
lm-evaluation-harness copied to clipboard

mlx Model (loglikelihood & generate_until)

lm-evaluation-harness lm-evaluation-harness copied to clipboard

mlx Model (loglikelihood & generate_until)

lm-evaluation-harness
lm-evaluation-harness copied to clipboard