lm-evaluation-harness Reported metrics are different with multi-node

Thanks for this repo!

I'm trying to set up evaluations across multiple nodes, multiple gpus using accelerate launch. I've found that when running on multiple nodes, I'm getting results that are slightly different than single-node results.

To debug, I've tried running the EleutherAI/pythia-70m model on the lambada_openai task.

One Node

With one node, I run with the following:

lm_eval \
    --model hf  \
    --model_args pretrained=EleutherAI/pythia-70m,dtype="float" \
    --tasks lambada_openai \
    --num_fewshot 0 \
    --batch_size auto:4

As output, I get the following:

hf (pretrained=EleutherAI/pythia-70m,dtype=float), gen_kwargs: (), limit: None, num_fewshot: 0, batch_size: auto:4 (64,64,64,64,64)
|    Tasks     |Version|Filter|n-shot|  Metric  | Value  |   |Stderr|
|--------------|-------|------|-----:|----------|-------:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|130.9655|±  |5.5013|
|              |       |none  |     0|acc       |  0.2271|±  |0.0058|

Using accelerate, I can get the same result with two gpus on the same machine.

Two Nodes

Launching the following on two nodes, two gpus each, with a shared $MASTER_PORT and $MASTER_ADDR

accelerate launch \
    --multi_gpu \
    --num_machines 2 \
    --num_processes 4 \
    --main_process_ip "$MASTER_ADDR" \
    --main_process_port $MASTER_PORT \
    --machine_rank \$NODE_RANK \
    --rdzv_backend static \
    --max_restarts 0 \
    lm_eval \
    --model hf  \
    --model_args pretrained=EleutherAI/pythia-70m,dtype="float" \
    --tasks lambada_openai \
    --num_fewshot 0 \
    --batch_size auto:4

Output is:

[default0]:hf (pretrained=EleutherAI/pythia-70m,dtype=float), gen_kwargs: (), limit: None, num_fewshot: 0, batch_size: auto:4 (64,64,64,64,64)
[default0]:|    Tasks     |Version|Filter|n-shot|  Metric  | Value  |   |Stderr|
[default0]:|--------------|-------|------|-----:|----------|-------:|---|-----:|
[default0]:|lambada_openai|Yaml   |none  |     0|perplexity|127.3498|±  |5.4159|
[default0]:|              |       |none  |     0|acc       |  0.2266|±  |0.0058|

Close, but slightly different than the results on a single node.

Any help would be appreciated!

python=3.9.18
accelerate=0.25.0
lm-evaluation-harness commit: e5dfd03 (0.4.0)

Dec 07 '23 23:12 keeganq

Hi! Thanks for reporting this. We currently don't support or test for multi-node use via accelerate--we will better document this for future use.

If there is demand for it and another tool or library would not be an improvement here, we can consider testing it and investigate it. My preference would be to support multi-node via, say, allowing for multiple self-hosted API inference servers with data-parallel vLLM, but if there is a compelling reason to support accelerate multinode then that'd be great to know your use case!

Dec 07 '23 23:12 haileyschoelkopf

Understood, thanks for your reply! This certainly isn't a necessity for the library. I was just hoping to take advantage of more compute to speed up evaluations of larger benchmarks and models.

Dec 08 '23 23:12 keeganq

@haileyschoelkopf , regarding your proposal My preference would be to support multi-node via, say, allowing for multiple self-hosted API inference servers with data-parallel vLLM, do you mind referring some implementations that already exist? If not, any general ideas on how to implement? Thanks

Jan 12 '24 16:01 leocnj

@leocnj I am not sure if there are good options already out there in open-source for scaling inference to multi-node setups, but the easiest way to support this would be to take one of our API interface LM classes, and allow it to take a list of URLs/server addresses, and split the incoming requests up evenly between all those URLs to parallelize across multiple nodes.

So for example, in local-chat-completions or the incoming local-completions, allowing the user to spin up a local API on each node they have access to, then telling lm_eval all the addresses available and making a client that executes 1 / N of those requests each.

Does this make sense?

Jan 12 '24 19:01 haileyschoelkopf

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Reported metrics are different with multi-node

One Node

Two Nodes

lm-evaluation-harness
lm-evaluation-harness copied to clipboard