llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

eval.py error while benchmarking T5

Open sigjhl opened this issue 2 years ago • 1 comments

Console

[Eval batch=1/1289] Eval on lambada_openai/0-shot data [Eval batch=130/1289] Eval on lambada_openai/0-shot data [Eval batch=259/1289] Eval on lambada_openai/0-shot data [Eval batch=387/1289] Eval on lambada_openai/0-shot data [Eval batch=516/1289] Eval on lambada_openai/0-shot data [Eval batch=645/1289] Eval on lambada_openai/0-shot data [Eval batch=774/1289] Eval on lambada_openai/0-shot data [Eval batch=903/1289] Eval on lambada_openai/0-shot data [Eval batch=1031/1289] Eval on lambada_openai/0-shot data [Eval batch=1160/1289] Eval on lambada_openai/0-shot data /home/codeless/Desktop/llm-foundry/mosaic/lib/python3.10/site-packages/composer/core/data_spec.py:35: UserWarning: Cannot split tensor of length 1 into batches of size 4. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size. warnings.warn(f'Cannot split tensor of length {len(t)} into batches of size {microbatch_size}. ' /home/codeless/Desktop/llm-foundry/mosaic/lib/python3.10/site-packages/composer/core/data_spec.py:26: UserWarning: Cannot split list of length 1 into batches of size 4. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size. warnings.warn(f'Cannot split list of length {len(l)} into batches of size {microbatch_size}. ' [Eval batch=1289/1289] Eval on lambada_openai/0-shot data [Eval batch=1/919] Eval on piqa/10-shot data [Eval batch=93/919] Eval on piqa/10-shot data [Eval batch=185/919] Eval on piqa/10-shot data [Eval batch=276/919] Eval on piqa/10-shot data [Eval batch=368/919] Eval on piqa/10-shot data [Eval batch=460/919] Eval on piqa/10-shot data [Eval batch=552/919] Eval on piqa/10-shot data [Eval batch=644/919] Eval on piqa/10-shot data [Eval batch=735/919] Eval on piqa/10-shot data [Eval batch=827/919] Eval on piqa/10-shot data [Eval batch=919/919] Eval on piqa/10-shot data [Eval batch=1/10042] Eval on hellaswag/10-shot data [Eval batch=1005/10042] Eval on hellaswag/10-shot data [Eval batch=2009/10042] Eval on hellaswag/10-shot data [Eval batch=3013/10042] Eval on hellaswag/10-shot data [Eval batch=4017/10042] Eval on hellaswag/10-shot data [Eval batch=5022/10042] Eval on hellaswag/10-shot data [Eval batch=6026/10042] Eval on hellaswag/10-shot data [Eval batch=7030/10042] Eval on hellaswag/10-shot data [Eval batch=8034/10042] Eval on hellaswag/10-shot data [Eval batch=9038/10042] Eval on hellaswag/10-shot data [Eval batch=10042/10042] Eval on hellaswag/10-shot data [Eval batch=1/2376] Eval on arc_easy/10-shot data [Eval batch=238/2376] Eval on arc_easy/10-shot data [Eval batch=476/2376] Eval on arc_easy/10-shot data [Eval batch=714/2376] Eval on arc_easy/10-shot data [Eval batch=951/2376] Eval on arc_easy/10-shot data [Eval batch=1188/2376] Eval on arc_easy/10-shot data [Eval batch=1426/2376] Eval on arc_easy/10-shot data [Eval batch=1664/2376] Eval on arc_easy/10-shot data [Eval batch=1901/2376] Eval on arc_easy/10-shot data [Eval batch=2138/2376] Eval on arc_easy/10-shot data [Eval batch=2376/2376] Eval on arc_easy/10-shot data [Eval batch=1/1172] Eval on arc_challenge/10-shot data [Eval batch=118/1172] Eval on arc_challenge/10-shot data [Eval batch=235/1172] Eval on arc_challenge/10-shot data [Eval batch=352/1172] Eval on arc_challenge/10-shot data [Eval batch=469/1172] Eval on arc_challenge/10-shot data [Eval batch=586/1172] Eval on arc_challenge/10-shot data [Eval batch=704/1172] Eval on arc_challenge/10-shot data [Eval batch=821/1172] Eval on arc_challenge/10-shot data [Eval batch=938/1172] Eval on arc_challenge/10-shot data [Eval batch=1055/1172] Eval on arc_challenge/10-shot data [Eval batch=1172/1172] Eval on arc_challenge/10-shot data [Eval batch=1/50] Eval on copa/0-shot data [Eval batch=6/50] Eval on copa/0-shot data [Eval batch=11/50] Eval on copa/0-shot data [Eval batch=16/50] Eval on copa/0-shot data [Eval batch=21/50] Eval on copa/0-shot data [Eval batch=26/50] Eval on copa/0-shot data [Eval batch=30/50] Eval on copa/0-shot data [Eval batch=35/50] Eval on copa/0-shot data [Eval batch=40/50] Eval on copa/0-shot data [Eval batch=45/50] Eval on copa/0-shot data [Eval batch=50/50] Eval on copa/0-shot data [Eval batch=1/1635] Eval on boolq/10-shot data [Eval batch=164/1635] Eval on boolq/10-shot data [Eval batch=328/1635] Eval on boolq/10-shot data [Eval batch=491/1635] Eval on boolq/10-shot data [Eval batch=655/1635] Eval on boolq/10-shot data [Eval batch=818/1635] Eval on boolq/10-shot data [Eval batch=981/1635] Eval on boolq/10-shot data [Eval batch=1145/1635] Eval on boolq/10-shot data [Eval batch=1308/1635] Eval on boolq/10-shot data [Eval batch=1472/1635] Eval on boolq/10-shot data [Eval batch=1635/1635] Eval on boolq/10-shot data Ran google/flan-t5-xl eval in: 13817.477584123611 seconds ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/codeless/Desktop/llm-foundry/scripts/eval/eval.py:252 in │ │ │ │ 249 │ │ yaml_cfg = om.load(f) │ │ 250 │ cli_cfg = om.from_cli(args_list) │ │ 251 │ cfg = om.merge(yaml_cfg, cli_cfg) │ │ ❱ 252 │ main(cfg) │ │ 253 │ │ │ │ /home/codeless/Desktop/llm-foundry/scripts/eval/eval.py:126 in main │ │ │ │ 123 │ │ │ │ │ │ │ │ │ │ │ model_gauntlet_df) │ │ 124 │ │ │ │ 125 │ │ if model_gauntlet_callback is not None: │ │ ❱ 126 │ │ │ composite_scores = model_gauntlet_callback.eval_end( │ │ 127 │ │ │ │ None, in_memory_logger) │ │ 128 │ │ │ │ 129 │ │ benchmark_to_taxonomy = {} │ │ │ │ /home/codeless/Desktop/llm-foundry/llmfoundry/callbacks/model_gauntlet_callback.py:112 in │ │ eval_end │ │ │ │ 109 │ │ return {k: sum(v) / len(v) for k, v in results.items()} │ │ 110 │ │ │ 111 │ def eval_end(self, state: State, logger: Logger): │ │ ❱ 112 │ │ new_metrics = self.compute_averages(logger) │ │ 113 │ │ composite_scores = {} │ │ 114 │ │ for category in self.categories: │ │ 115 │ │ │ composite_scores[category['name']] = [] │ │ │ │ /home/codeless/Desktop/llm-foundry/llmfoundry/callbacks/model_gauntlet_callback.py:92 in │ │ compute_averages │ │ │ │ 89 │ │ │ 'metrics/(.?)/(\d+)-shot(/.?)?/InContextLearning(.*)') │ │ 90 │ │ for key in self.logger_keys: │ │ 91 │ │ │ match = pat.match(key) │ │ ❱ 92 │ │ │ val = logger_data.data[key][0][1].item() │ │ 93 │ │ │ │ │ 94 │ │ │ if match: │ │ 95 │ │ │ │ eval_name = match.group(1) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: 'metrics/lambada_openai/0-shot/InContextLearningLMAccuracy' ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. Global rank 0 (PID 11800) exited with code 1 ERROR:composer.cli.launcher:Global rank 0 (PID 11800) exited with code 1

To reproduce

I pip installed mosaicml and llm-foundry requirements yesterday, and ran the eval.py script on a flan-t5-xl model according to the quickstart guide. I only changed the max_seq_len, icl_seq_len to 512, model_name_or_path = google/flan-t5-xl, and model name to hf_t5, in hf_eval.yaml and tasks_light.yaml

Expected behavior

Successful benchmarking.

Additional context

I can't figure out why it couldn't find the key in the logger. I lack the experience to dig into it more, so I hope this info is enough for you guys to figure out what's wrong.

By the way, where is the benchmark results saved to?

sigjhl avatar Jul 14 '23 22:07 sigjhl

cc: @bmosaicml who worked on the evaluation code, to take a look.

hanlint avatar Jul 23 '23 16:07 hanlint