lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Reported metrics are different with multi-node
Thanks for this repo!
I'm trying to set up evaluations across multiple nodes, multiple gpus using accelerate launch
. I've found that when running on multiple nodes, I'm getting results that are slightly different than single-node results.
To debug, I've tried running the EleutherAI/pythia-70m
model on the lambada_openai
task.
One Node
With one node, I run with the following:
lm_eval \
--model hf \
--model_args pretrained=EleutherAI/pythia-70m,dtype="float" \
--tasks lambada_openai \
--num_fewshot 0 \
--batch_size auto:4
As output, I get the following:
hf (pretrained=EleutherAI/pythia-70m,dtype=float), gen_kwargs: (), limit: None, num_fewshot: 0, batch_size: auto:4 (64,64,64,64,64)
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|-------|------|-----:|----------|-------:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity|130.9655|± |5.5013|
| | |none | 0|acc | 0.2271|± |0.0058|
Using accelerate
, I can get the same result with two gpus on the same machine.
Two Nodes
Launching the following on two nodes, two gpus each, with a shared $MASTER_PORT
and $MASTER_ADDR
accelerate launch \
--multi_gpu \
--num_machines 2 \
--num_processes 4 \
--main_process_ip "$MASTER_ADDR" \
--main_process_port $MASTER_PORT \
--machine_rank \$NODE_RANK \
--rdzv_backend static \
--max_restarts 0 \
lm_eval \
--model hf \
--model_args pretrained=EleutherAI/pythia-70m,dtype="float" \
--tasks lambada_openai \
--num_fewshot 0 \
--batch_size auto:4
Output is:
[default0]:hf (pretrained=EleutherAI/pythia-70m,dtype=float), gen_kwargs: (), limit: None, num_fewshot: 0, batch_size: auto:4 (64,64,64,64,64)
[default0]:| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
[default0]:|--------------|-------|------|-----:|----------|-------:|---|-----:|
[default0]:|lambada_openai|Yaml |none | 0|perplexity|127.3498|± |5.4159|
[default0]:| | |none | 0|acc | 0.2266|± |0.0058|
Close, but slightly different than the results on a single node.
Any help would be appreciated!
python=3.9.18
accelerate=0.25.0
lm-evaluation-harness commit: e5dfd03 (0.4.0)
Hi! Thanks for reporting this. We currently don't support or test for multi-node use via accelerate
--we will better document this for future use.
If there is demand for it and another tool or library would not be an improvement here, we can consider testing it and investigate it. My preference would be to support multi-node via, say, allowing for multiple self-hosted API inference servers with data-parallel vLLM, but if there is a compelling reason to support accelerate
multinode then that'd be great to know your use case!
Understood, thanks for your reply! This certainly isn't a necessity for the library. I was just hoping to take advantage of more compute to speed up evaluations of larger benchmarks and models.
@haileyschoelkopf , regarding your proposal My preference would be to support multi-node via, say, allowing for multiple self-hosted API inference servers with data-parallel vLLM
, do you mind referring some implementations that already exist? If not, any general ideas on how to implement? Thanks
@leocnj I am not sure if there are good options already out there in open-source for scaling inference to multi-node setups, but the easiest way to support this would be to take one of our API interface LM classes, and allow it to take a list of URLs/server addresses, and split the incoming requests up evenly between all those URLs to parallelize across multiple nodes.
So for example, in local-chat-completions
or the incoming local-completions
, allowing the user to spin up a local API on each node they have access to, then telling lm_eval all the addresses available and making a client that executes 1 / N of those requests each.
Does this make sense?