sarathi-serve offline_inference.py occurs error that append_paged_kv_cache() missing 1 required positional argument: 'kv_last_page

When I try to run offline_inference.py, I met the following error:

(RayWorker pid=2597779) Traceback (most recent call last): [repeated 3x across cluster] (RayWorker pid=2597779) File "/home/yujy/sarathi-serve/sarathi/utils/threading_utils.py", line 27, in wrapper [repeated 3x across cluster] (RayWorker pid=2597779) return func(*args, **kwargs) [repeated 6x across cluster] (RayWorker pid=2597779) File "/home/yujy/sarathi-serve/sarathi/worker/pipeline_parallel_worker.py", line 67, in _execution_loop [repeated 3x across cluster] (RayWorker pid=2597779) output = self.execute_model(step_inputs.scheduler_outputs) [repeated 3x across cluster] (RayWorker pid=2597779) File "/home/yujy/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 3x across cluster] (RayWorker pid=2597779) File "/home/yujy/sarathi-serve/sarathi/worker/base_worker.py", line 179, in execute_model [repeated 3x across cluster] (RayWorker pid=2597779) sampler_outputs = self.model_runner.run( [repeated 3x across cluster] (RayWorker pid=2597779) File "/home/yujy/sarathi-serve/sarathi/model_executor/model_runner.py", line 239, in run [repeated 3x across cluster] (RayWorker pid=2597779) output = self.model( [repeated 3x across cluster] (RayWorker pid=2597779) File "/home/yujy/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [repeated 12x across cluster] (RayWorker pid=2597779) return self._call_impl(*args, **kwargs) [repeated 12x across cluster] (RayWorker pid=2597779) File "/home/yujy/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [repeated 12x across cluster] (RayWorker pid=2597779) return forward_call(*args, **kwargs) [repeated 12x across cluster] (RayWorker pid=2597779) File "/home/yujy/sarathi-serve/sarathi/model_executor/models/llama.py", line 179, in forward [repeated 12x across cluster] (RayWorker pid=2597779) hidden_states = self.model(hidden_states, positions, attention_backend_wrapper) [repeated 3x across cluster] (RayWorker pid=2597779) hidden_states = layer( [repeated 3x across cluster] (RayWorker pid=2597779) hidden_states = self.self_attn( [repeated 3x across cluster] (RayWorker pid=2597779) attn_output = attention_backend_wrapper.forward( [repeated 3x across cluster] (RayWorker pid=2597779) File "/home/yujy/sarathi-serve/sarathi/model_executor/attention/flashinfer_attention_wrapper.py", line 237, in forward [repeated 3x across cluster] (RayWorker pid=2597779) append_paged_kv_cache( [repeated 3x across cluster] (RayWorker pid=2597779) TypeError: append_paged_kv_cache() missing 1 required positional argument: 'kv_last_page_len' [repeated 3x across cluster]

By the way, The model I use is Llama2-7b, is the problem caused by model switching?

Jan 09 '25 10:01 yuhkalhic

I am facing the same issue.

Jan 17 '25 20:01 cxinyic

The issue is related to the flashinfer code. They refactored their code after v0.1.5. Using an older version of flashinfer will work.

Jan 25 '25 15:01 Mrxiangli

Hi I also encounter this issue when running code in examples folder, so did u have solution to fix it?

Jan 31 '25 12:01 Noblezhong

Hi I also encounter this issue when running code in examples folder, so did u have solution to fix it?

I think it's fixed by degrading the version of flashinfer into 0.15.0, like the commit of today. Thanks u!

Jan 31 '25 13:01 Noblezhong

offline_inference.py occurs error that append_paged_kv_cache() missing 1 required positional argument: 'kv_last_page_len'

By the way, The model I use is Llama2-7b, is the problem caused by model switching?