lm-evaluation-harness Evaluation Error on Scrolls Task

I am trying to evaluate LLM on long text understanding. I encountered the following error. I tried to resolve minor bugs but couldn't resolve this one.

First I faced this issue acc_norm = 1.0 if np.argmax(results / completion_len) == gold else 0.0 ~~~~~~~~^~~~~~~~~~~~~~~~ ValueError: operands could not be broadcast together with shapes (3,2) (3,)

I checked for the dimensions. Apparently, results parameter had a dimension of (3,2). It was present in this format [(-16.322248458862305, False), (-14.276965141296387, False), (-10.922859191894531, False)] To resolve this I added another line of code results = [result[0] for result in results]

But after this, I faced another error, which I couldn't figure out. 

Traceback (most recent call last):
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/__main__.py", line 207, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 152, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
    results[task_name][metric + "_stderr" + "," + key] = stderr(items)
                                                         ^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/api/metrics.py", line 206, in mean_stderr
    return sample_stddev(arr) / math.sqrt(len(arr))
           ^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/api/metrics.py", line 202, in sample_stddev
    return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
ZeroDivisionError: float division by zero

Dec 12 '23 10:12 AdityaKulshrestha

Are you using limit 1 for the second error? Might be because it divides by N - 1 to calculate the sample standard deviation.

cc @lintangsutawika

Dec 12 '23 10:12 baberabb

Yeah, division by zero looks like an error from using only 1 sample. I should patch that.

Dec 12 '23 10:12 lintangsutawika

Yes, thanks. Indeed using limit 1 was the issue. The issue is gone but I have encountered another issue.

Traceback (most recent call last):
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/__main__.py", line 207, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 152, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/evaluator.py", line 482, in evaluate
    results[task_name][metric_key] = agg_fn(items)
                                     ^^^^^^^^^^^^^
  File "/home/user1/LLM-BenchMarking/lm-evaluation-harness/lm_eval/tasks/scrolls/task.py", line 197, in compute_metrics
    computed = self.metric.compute(
               ^^^^^^^^^^^
AttributeError: 'GovReport' object has no attribute 'metric'

Dec 12 '23 13:12 AdityaKulshrestha

Will take a look

Dec 12 '23 13:12 lintangsutawika

Sorry, reopening this as it turns out there is still an issue in the loading the metric.

Dec 12 '23 13:12 lintangsutawika

lm-evaluation-harness
lm-evaluation-harness copied to clipboard

Evaluation Error on Scrolls Task - 2

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Evaluation Error on Scrolls Task - 2

lm-evaluation-harness
lm-evaluation-harness copied to clipboard