Results 10 comments of shixianc

@ptarasiewiczNV Thank you for the reply. Regarding 1 it would be nice to have that as some of our models are small enough (can be loaded on a 16GB GPU)...

Hi is there any update on this feature? This is quite useful for loading large LLM from s3.

The automatic prefix caching commit seems merged very recent and labeled as 0.3.4 release. So I assume some changes are not available on 0.3.3 Update: actually I just tested that...

@robertgshaw2-neuralmagic thanks, we're really looking forward for the optimization! Also, could you clarify on the behavior of this feature: 1. in the same batch, first N tokens of the requests...

Is anyone able to run it on 4 A10 GPUs? 4*24GB=96GB

Hi is there an update for this?

@symphonylyh Thanks for the update! Starting with (3) would unblock our team. May I assume this would also have the classic dynamic batching supported?