Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

BigScience Eval Harness

Open Muennighoff opened this issue 2 years ago • 0 comments

Changes:

  • Adds compatibility with the lm-evaluation-harness fork of BigScience here.
  • Reduces the memory load by offloading to CPU earlier, thanks to @thomasw21!

Notes:

  • Almost the same as the existing evaluate functionality, but some changes in the .py script as the bigscience fork has diverged form the original evaluation harness. :)
  • Sorry for the long commit history - I will squash when merging.

RE: Memory: micro_bs_multiplier=16, Now:

[default6]:Running loglikelihood requests
[default7]:Running loglikelihood requests
[default0]:
[default0]:  0%|          | 0/6540 [00:00<?, ?it/s][default0]:
[default0]:  0%|          | 16/6540 [00:04<31:39,  3.44it/s][default0]:
[default0]:  0%|          | 32/6540 [00:20<1:15:48,  1.43it/s][default0]:
[default0]:  1%|          | 48/6540 [00:29<1:09:21,  1.56it/s][default0]:
[default0]:  1%|    	  | 64/6540 [00:36<1:01:34,  1.75it/s][default0]:
[default0]:  1%|          | 80/6540 [00:43<55:39,  1.93it/s]  [default0]:
[default0]:  1%| ^v^o         | 96/6540 [00:50<51:09,  2.10it/s][default0]:
[default0]:  2%| ^v^o         | 112/6540 [00:56<47:33,  2.25it/s][default0]:
[default0]:  2%| ^v^o         | 128/6540 [01:01<44:25,  2.41it/s][default0]:
[default0]:  2%| ^v^o         | 144/6540 [01:07<42:08,  2.53it/s][default0]:
[default0]:  2%| ^v^o         | 160/6540 [01:12<40:23,  2.63it/s][default0]:
[default0]:  3%| ^v^n         | 176/6540 [01:18<39:02,  2.72it/s][default0]:
[default0]:  3%| ^v^n         | 192/6540 [01:23<37:54,  2.79it/s][default0]:
[default0]:  3%| ^v^n         | 208/6540 [01:29<36:52,  2.86it/s][default0]:
[default0]:  3%| ^v^n         | 224/6540 [01:34<36:00,  2.92it/s][default0]:
[default0]:  4%| ^v^n         | 240/6540 [01:39<35:03,  3.00it/s][default0]:
[default0]:  4%| ^v^m         | 256/6540 [01:44<34:21,  3.05it/s][default0]:
[default0]:  4%| ^v^m         | 272/6540 [01:49<33:50,  3.09it/s][default0]:
[default0]:  4%| ^v^m         | 288/6540 [01:54<33:21,  3.12it/s][default0]:
[default0]:  5%| ^v^m         | 304/6540 [01:59<32:50,  3.17it/s]

Previously:

[default7]:}
[default7]:warning: provide_description is deprecated and will be removed in a future version in favor of description_dict
[default5]:Running loglikelihood requests
[default6]:Running loglikelihood requests
[default7]:Running loglikelihood requests
[default0]:
[default0]:  0%|          | 0/6540 [00:00<?, ?it/s][default0]:
[default0]:  0%|          | 16/6540 [00:04<31:16,  3.48it/s][default7]:Traceback (most recent call last):
[default7]:  File "./tasks/eval_harness/evaluate.py", line 446, in <module>
[default7]:    main()
[default7]:  File "./tasks/eval_harness/evaluate.py", line 429, in main
[default7]:    results = evaluator.evaluate(adaptor, {task_name: task}, False, 0, None, bootstrap_iters=args.bootstrap_iters)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/eval/lm-evaluation-harness-thomas/lm_eval/utils.py", line 164, in _wrapper
[default7]:    return fn(*args, **kwargs)
[default7]:  File "/gpfsssd/worksf/projects/rech/six/commun/code/eval/lm-evaluation-harness-thomas/lm_eval/evaluator.py", line 247, in evaluate
[default7]:    resps = getattr(lm, reqtype)([req.args for req in reqs])
[default7]:  File "./tasks/eval_harness/evaluate.py", line 91, in loglikelihood
[default7]:    return self._loglikelihood_tokens(new_reqs)
[default7]:  File "./tasks/eval_harness/evaluate.py", line 157, in _loglikelihood_tokens
[default7]:    logits = self._model_call(torch.cat(inps, dim=0))
[default7]:  File "./tasks/eval_harness/evaluate.py", line 220, in _model_call
[default7]:    output = torch.cat(output, 0)[:len(inps)]
[default7]:RuntimeError: CUDA out of memory. Tried to allocate 15.49 GiB (GPU 7; 79.17 GiB total capacity; 60.01 GiB already allocated; 13.71 GiB free; 61.92 GiB reserved in total by PyTorch) If reserved$
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857715 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857716 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857717 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857718 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857719 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857720 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 857721 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 7 (pid: 857722) of binary: /gpfswork/rech/six/commun/conda/py38-pt111/bin/python

Muennighoff avatar Jun 29 '22 11:06 Muennighoff