lm-evaluation-harness Dp and mp support

adds support for both model and data parallelism

Jul 02 '24 15:07 NathanHB

Hi @NathanHB ! I'll aim to test this out very soon. Would you be willing to update the main README.md sections with the new intended usage for parallelize=True with / versus accelerate launch?

I'm torn with respect to multi-machine usage, we'd previously decided this was out-of-scope for HF models for us--is this something you all are using regularly for leaderboard evals though?

Jul 09 '24 13:07 haileyschoelkopf

Thanks @haileyschoelkopf ! Will update the readme.

I'm torn with respect to multi-machine usage, we'd previously decided this was out-of-scope for HF models for us--is this something you all are using regularly for leaderboard evals though?

Yes we use it for every model that do not fit a GPU. We need it to be able to evaluate as fast as possible, some evals would take days without DP.

Why would it be out of scope for HF model ?

Jul 10 '24 12:07 NathanHB

Note: This mostly allows to run one model by using both data parallelism and pipeline parallelism on several GPUs of the same machine, so I'm unsure why you mention multi-machine usage. Both are at the moment available in the harness, but not at the same time for one model, which this PR allows.

Jul 10 '24 13:07 clefourrier

Maybe this is user error on my part, but this seems not to be giving the desired result:

accelerate launch --num_processes 2 --multi_gpu lm_eval --model hf --model_args pretrained=gpt2,parallelize=True --tasks lambada_openai
2024-07-11:17:40:43,032 INFO     [__main__.py:272] Verbosity set to INFO
2024-07-11:17:40:43,042 INFO     [__main__.py:272] Verbosity set to INFO
2024-07-11:17:40:47,431 INFO     [__main__.py:369] Selected Tasks: ['lambada_openai']
2024-07-11:17:40:47,436 INFO     [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-07-11:17:40:47,436 INFO     [evaluator.py:189] Initializing hf model, with arguments: {'pretrained': 'gpt2', 'parallelize': True}
2024-07-11:17:40:47,461 INFO     [__main__.py:369] Selected Tasks: ['lambada_openai']
2024-07-11:17:40:47,464 INFO     [evaluator.py:152] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-07-11:17:40:47,465 INFO     [evaluator.py:189] Initializing hf model, with arguments: {'pretrained': 'gpt2', 'parallelize': True}
2024-07-11:17:40:48,299 INFO     [huggingface.py:372] Model parallel was set to True, setting max memory per GPU to {1: 50485133312, 3: 50485133312} and device map to 'auto'
2024-07-11:17:40:48,301 INFO     [huggingface.py:372] Model parallel was set to True, setting max memory per GPU to {0: 50762940416, 2: 50487230464} and device map to 'auto'
2024-07-11:17:40:49,395 WARNING  [huggingface.py:270] You are both using a HF Accelerate `device_map` and launching via `accelerate launch`. This will attempt to do model and data parallelism depending on the resources available.
2024-07-11:17:40:49,501 WARNING  [huggingface.py:270] You are both using a HF Accelerate `device_map` and launching via `accelerate launch`. This will attempt to do model and data parallelism depending on the resources available.
2024-07-11:17:40:50,623 WARNING  [task.py:325] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-07-11:17:40:50,623 WARNING  [task.py:325] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-07-11:17:40:50,674 INFO     [evaluator.py:261] Setting fewshot random generator seed to 1234
2024-07-11:17:40:50,675 INFO     [task.py:411] Building contexts for lambada_openai on rank 0...
 11%|██████                                                 | 573/5153 [00:00<00:05, 808.93it/s]2024-07-11:17:40:51,453 WARNING  [task.py:325] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-07-11:17:40:51,453 WARNING  [task.py:325] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-07-11:17:40:51,504 INFO     [evaluator.py:261] Setting fewshot random generator seed to 1234
2024-07-11:17:40:51,504 INFO     [task.py:411] Building contexts for lambada_openai on rank 0...
100%|██████████████████████████████████████████████████████| 5153/5153 [00:06<00:00, 825.19it/s]
2024-07-11:17:40:56,971 INFO     [evaluator.py:438] Running loglikelihood requests
100%|██████████████████████████████████████████████████████| 5153/5153 [00:06<00:00, 820.96it/s]
2024-07-11:17:40:57,831 INFO     [evaluator.py:438] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████| 5153/5153 [00:38<00:00, 133.56it/s]
Running loglikelihood requests:  99%|█████████████████████▉| 5127/5153 [00:39<00:00, 137.75it/s]bootstrapping for stddev: perplexity
Running loglikelihood requests: 100%|██████████████████████| 5153/5153 [00:39<00:00, 130.84it/s]
100%|█████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 88.52it/s]
bootstrapping for stddev: perplexity
100%|█████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 88.34it/s]
2024-07-11:17:41:43,189 INFO     [evaluation_tracker.py:240] Output path not provided, skipping saving results aggregated
hf (pretrained=gpt2,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  | 0.3256|±  |0.0065|
|              |       |none  |     0|perplexity|↓  |40.0554|±  |1.4787|

2024-07-11:17:41:45,522 INFO     [evaluation_tracker.py:240] Output path not provided, skipping saving results aggregated
hf (pretrained=gpt2,parallelize=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|    Tasks     |Version|Filter|n-shot|  Metric  |   | Value |   |Stderr|
|--------------|------:|------|-----:|----------|---|------:|---|-----:|
|lambada_openai|      1|none  |     0|acc       |↑  | 0.3256|±  |0.0065|
|              |       |none  |     0|perplexity|↓  |40.0554|±  |1.4787|

this is what I get when trying to run on 4 gpus with --num_processes 2:

results print twice. This seems to imply that both processes think that they are rank 0 and the only process
It seems like the conditional at L269 implies we fall through all the other conditions there, meaning rank and world size never get set to something different than the default Rank=0, Worldsize=1
Request counts aren't split in half, also indicating that lm.world_size and lm.rank aren't correctly set. We should see half the requests run on each process, and then see only a single rank report back/save results + print tables.

Jul 11 '24 17:07 haileyschoelkopf

@haileyschoelkopf I added a fix for the duplication of logs - can you tell me if it's better on your side?

Jul 15 '24 10:07 clefourrier

Hi @haileyschoelkopf ! the linters test is not passing because of files I did not modify. I merged main and ran the linter again modifiying a few more files than necessery in the PR

Jul 15 '24 11:07 NathanHB

Hi @NathanHB , #2104 should fix the linters!

Jul 15 '24 15:07 haileyschoelkopf

Thanks a lot for the fix, merged it to the PR!

Aug 05 '24 06:08 clefourrier