Matt Psaltis comments

Results 9 comments of


                                            Matt Psaltis

Multiple models second models always request GPU: 1

Thanks @akshay-anyscale Model yamls attached below I've tried 0.25 and 0.5 for the num_gpus_per_worker value. It definitely seems to pick it up in the early boot logs: ``` [INFO 2023-11-24...

Update to latest vLLM upstream and Support vLLM on CPU

Hey all, I also have similar updates on a fork however I've struggled to get feedback from the maintainers to work out how to proceed here. I similiarly updated rayllm...

Ray-LLM Head with VLLM Head throws configuration error

Vllm is simply moving too quickly with multiple breaking changes for ray-llm. Given the last significant update was three months ago for rayllm, I'm not sure I can offer you...

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

Just sharing my experience with this issue - Seems to align with the OPs experience. Summary: CPU constrained guidance means that batching can't scale correctly. Vllm 0.4.2 Outlines: 0.0.34 lm_format_enforcer:...

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

Here's line timings for model_executor/guided_decoding/outlines_logits_processors.py ``` Line # Hits Time Per Hit % Time Line Contents ============================================================== 41 @line_profiler.profile 42 def __call__(self, input_ids: List[int], 43 scores: torch.Tensor) -> torch.Tensor: 44...

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

I've been doing some further perf analysis and breaking things out a bit to try and understand the bottleneck. Doesn't seem to be related to the indexer but rather, moving...

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

58 12693 9401753.9 740.7 17.2 allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id]) 59 12693 42707835.5 3364.7 78.1 np_allowed_tokens = np.array(allowed_tokens, dtype=np.int32) 60 12693 73736.7 5.8 0.1 allowed_tokens_tensor = torch.from_numpy(np_allowed_tokens) Halved the cost by using...

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

Beyond this, I'm not sure I see a way forward without changes to outlines and lm-format-enforcer to provide the information in a more efficient structure than a List. Does anyone...

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

I went down that same line of thinking - I don't think the timings above support it however. Its getting the python List into a Tensor that seems to be...