evals
evals copied to clipboard
Eval-running often hangs on last sample
Describe the bug
Relatively often, my eval-run will be at say samples 199/200 but then will hang for a very long period of time on the last one. It isn't clear to me why this occurs, but sometimes it'll persist as long as an hour or more, at which point I generally terminate the command from my CLI and try again
To Reproduce
I'm not sure how to make this happen every time unfortunately. It does seem more likely to happen on bigger sampling runs than small ones though.
Code snippets
No response
OS
macOS
Python version
Python v3.11
Library version
latest
Strangely, even after KeyboardInterrupt, it often takes a while for my Terminal to regain the ability to run normal commands after this error occurs - not sure if that helps to pin down the problem
I also have this issue. It is not about rate limits, because it happens despite running datasets that are definitely below the tokens per minute and requests per minute rate limits. However, it does only seem to show up for large datasets.
An example of the error trace when I ctrl+C twice to exit after it gets stuck for a long time:
Traceback (most recent call last):
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/multiprocessing/pool.py", line 856, in next
item = self._items.popleft()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
[...]
File "/home/lrudl/[...]/evals/evals/cli/oaieval.py", line 223, in run
result = eval.run(recorder)
File "/home/lrudl/[...]/evals/evals/elsuite/modelgraded/classify.py", line 107, in run
self.eval_all_samples(recorder, samples)
File "/home/lrudl/[...]/evals/evals/eval.py", line 146, in eval_all_samples
idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
for obj in iterable:
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/multiprocessing/pool.py", line 861, in next
self._cond.wait(timeout)
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 320, in wait
waiter.acquire()
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py'>
Traceback (most recent call last):
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1537, in _shutdown
atexit_call()
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1096, in join
self._wait_for_tstate_lock()
File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt:
Often all I need to do is try again a few times for it to eventually run all the way to completion, but: (1) This massively increases the token cost. (2) This makes it difficult to efficiently run many evals in sequence with a script, because you need to manually supervise it and get it unstuck many times. This is a major time cost for big eval projects.
It seems that this issue is influenced by a bug in tqdm, as discussed at https://github.com/tqdm/tqdm/issues/627. Applying the following patch significantly improved the situation.
diff -urN a/.venv/lib/python3.11/site-packages/evals/eval.py b/.venv/lib/python3.11/site-packages/evals/eval.py
--- a/.venv/lib/python3.11/site-packages/evals/eval.py 2023-11-29 12:55:58.214648049 +0900
+++ b/.venv/lib/python3.11/site-packages/evals/eval.py 2023-11-29 12:56:05.630671841 +0900
@@ -143,7 +143,8 @@
else:
logger.info(f"Running in threaded mode with {threads} threads!")
iter = pool.imap_unordered(eval_sample, work_items)
- idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+ # idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+ idx_and_result = list(iter)
return [r for _, r in sorted(idx_and_result)]
def get_samples(self):
I also had this issue. A workaround I found is to use the EVALS_THREADS_TIMEOUT
flag while running the command. It specifies the time allowed for every input to the model to run. It can be used as follows:
EVALS_THREADS_TIMEOUT=20 oaieval completion_fn eval_name