shixianc
shixianc
@ptarasiewiczNV Thank you for the reply. Regarding 1 it would be nice to have that as some of our models are small enough (can be loaded on a 16GB GPU)...
Hi is there any update on this feature? This is quite useful for loading large LLM from s3.
+11
The automatic prefix caching commit seems merged very recent and labeled as 0.3.4 release. So I assume some changes are not available on 0.3.3 Update: actually I just tested that...
@robertgshaw2-neuralmagic thanks, we're really looking forward for the optimization! Also, could you clarify on the behavior of this feature: 1. in the same batch, first N tokens of the requests...
Do we have an ETA? 😊
Is anyone able to run it on 4 A10 GPUs? 4*24GB=96GB
Hi is there an update for this?
@symphonylyh Thanks for the update! Starting with (3) would unblock our team. May I assume this would also have the classic dynamic batching supported?