aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

Fix metrics and allow disable block manager v2

Open Nero10578 opened this issue 8 months ago • 1 comments

With the latest block manager v2, it can cause max recursion depth errors with some models and certain sampler combinations. The fix is to use block manager v1, but the arg parser did not allow this. Hence this change to be able to use block manager v1 by using argument --use-v2-block-manager false

When using chunked prefill and LoRA with the old metrics instead of request level metrics, it can cause this error:

ERROR:    Engine background task failed
Exception in callback functools.partial(<function _log_task_completion at 0x7e38ccf05d00>, error_callback=<bound method AsyncAphrodite._error_callback of <aphrodite.engine.async_aphrodite.AsyncAphrodite object at 0x7e38c8a91cd0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7e38ccf05d00>, error_callback=<bound method AsyncAphrodite._error_callback of <aphrodite.engine.async_aphrodite.AsyncAphrodite object at 0x7e38c8a91cd0>>)>
Traceback (most recent call last):
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 54, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 809, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 735, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 391, in step_async
    self.do_log_stats(scheduler_outputs, outputs)
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1421, in do_log_stats
    loggers.log(stats)
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/metrics.py", line 559, in log
    self._log_prometheus(stats)
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/metrics.py", line 510, in _log_prometheus
    self._log_counter(self.metrics.counter_generation_tokens,
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/metrics.py", line 471, in _log_counter
    counter.labels(**self.labels).inc(data)
  File "/home/arli/miniconda3/envs/aphrodite/lib/python3.11/site-packages/prometheus_client/metrics.py", line 313, in inc
    raise ValueError('Counters can only be incremented by non-negative amounts.')
ValueError: Counters can only be incremented by non-negative amounts.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/arli/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 66, in _log_task_completion
    raise AsyncEngineDeadError(
aphrodite.engine.async_aphrodite.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Nero10578 avatar Apr 15 '25 05:04 Nero10578

Can you limit this PR to just the Metrics fix? Block Manager V1 is going to be deprecated as of #1300 so the other part of this PR will conflict with the changes.

AlpinDale avatar Apr 16 '25 10:04 AlpinDale