lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Evaluation Error on Scrolls Task - 2
I am trying to evaluate LLM on long text understanding. I encountered the following error. I tried to resolve minor bugs but couldn't resolve this one.
First I faced this issue
acc_norm = 1.0 if np.argmax(results / completion_len) == gold else 0.0 ~~~~~~~~^~~~~~~~~~~~~~~~ ValueError: operands could not be broadcast together with shapes (3,2) (3,)
I checked for the dimensions. Apparently, results parameter had a dimension of (3,2). It was present in this format
[(-16.322248458862305, False), (-14.276965141296387, False), (-10.922859191894531, False)]
To resolve this I added another line of code
results = [result[0] for result in results]
But after this, I faced another error, which I couldn't figure out.
Traceback (most recent call last):
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/bin/lm_eval", line 8, in <module>
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/__main__.py", line 207, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 152, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
results[task_name][metric + "_stderr" + "," + key] = stderr(items)
^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/api/metrics.py", line 206, in mean_stderr
return sample_stddev(arr) / math.sqrt(len(arr))
^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/api/metrics.py", line 202, in sample_stddev
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
ZeroDivisionError: float division by zero
Are you using limit 1
for the second error? Might be because it divides by N - 1 to calculate the sample standard deviation.
cc @lintangsutawika
Yeah, division by zero looks like an error from using only 1 sample. I should patch that.
Yes, thanks. Indeed using limit 1 was the issue. The issue is gone but I have encountered another issue.
Traceback (most recent call last):
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/bin/lm_eval", line 8, in <module>
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/__main__.py", line 207, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 152, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 482, in evaluate
results[task_name][metric_key] = agg_fn(items)
^^^^^^^^^^^^^
File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 197, in compute_metrics
computed = self.metric.compute(
^^^^^^^^^^^
AttributeError: 'GovReport' object has no attribute 'metric'
Will take a look
Sorry, reopening this as it turns out there is still an issue in the loading the metric.