windreamer

Results 22 comments of windreamer

> > After some dig-ups, I think I have enabled response_format for the OpenAI API Server in the last commit of the PR? Maybe you can have a try? >...

> import json > from typing import List > > from openai import OpenAI > from pydantic import BaseModel > > > class StoryOutput(BaseModel): > title: str > characters: List[str]...

@sunskyx can you try this way? https://github.com/InternLM/lmdeploy/pull/3925#issuecomment-3252045525 Sadly we do not have a usable RoCM environment to test and solve it now. @Vivicai1005 do you have any idea on it?

Can you kindly attach **the output of the following command** to help us debug? ``` lmdeploy check_env ```

By default, TP in Turbomind uses NCCL for multi-GPU communication, and this may get stucked due to incorrect NCCL environment setup. You may think of the following checklist to help...

I have just written a simple project to weekly build flash_attn 3 wheels. For any one interested, you can visit https://github.com/windreamer/flash-attention3-wheels for more details. And you can also install via...

We do not have this kind of devices to verify, but you can use the similar way to build LMDeploy in Jetson. You need to use NVIDIA SBSA base image...

我的理解是为了性能考虑,用户最后拿到的流式输出是以 block 为单位的,也就是每次输出一个或者多个 block 的解码结果。而并非是你理解的每个block内部的 diffsion step 的结果都会实时返回给用户。 具体的可能还是 @grimoire 更熟悉一些。

Can you elaborate it a bit more why the support for LMCache is nescessary? In my point of view, currently LMDeploy already has: - a built-in KV Cache management system...

> Does this offer the ability to offload layers to CPU and have the KV cache shared efficiently? When I tried last it didnt? In my opinion, the reason that...