Venkat Raman comments

Results 23 comments of


                                            Venkat Raman

docs: add Tiyaro in standardized model interfaces & demos section

Hello @rstojnic, Hope you are doing well. Looking forward to your PR review.

[#78] Added OpenSearch as Vector Datastore

@navneet1v Nice ! I was looking for this and came across your PR. I would like to propose making k-NN search as an optional feature that could be controlled via...

[#78] Added OpenSearch as Vector Datastore

@navneet1v Thanks for your quick response. Your approach sounds good 👍🏽

[#78] Added OpenSearch as Vector Datastore

@vamshin @navneet1v I'm not from OpenAI / maintainer of this repo. I'm an OpenSearch community user as well and was trying out your change while pending review. (Approved to not...

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit

Hey @martinigoyanes , Just taking a stab at the issue here. Without knowing actual model config, as per [this](https://www.anyscale.com/blog/continuous-batching-llm-inference) article: `The amount of GPU memory consumed scales with the base...

How to share memory among 2 GPUS for distributed inference?

Good question. In transformer inference, both `prefill` & `decode` phase use GPU VRAM. i.e., processing is done by moving model weights, kvcache, embeddings, etc., from VRAM to L2, L1 &...

How to share memory among 2 GPUS for distributed inference?

No problem @martinigoyanes. vLLM supports this: https://github.com/vllm-project/vllm/issues/2304 Maybe TGI already supports this natively or through vLLM integration ? I have to look into TGI config to have more clarity on...

How to share memory among 2 GPUS for distributed inference?

> I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in...

How to share memory among 2 GPUS for distributed inference?

Hey @martinigoyanes , > you can leverage 100% VRAM from the extra GPU Maybe there is a terminology/communication gap here. The above statement is not correct. Higher throughput is achieved...

How to share memory among 2 GPUS for distributed inference?

> When serving LLMs for "real" use cases, you must put some kind of rate limiter in front of it `vLLM`, `TGI` & `Triton` are already powering several "real" use...