leiwen83 comments

Results 39 comments of


                                            leiwen83

[Question] how to serve 72B Qwen1.5 into 4x3090 gpu?

> @leiwen83: I solved this error by adding the following line of code above line 115 in `huggingface_loader.py`: `preshard_funcs = {}` Yep, with this, "argument of type 'NoneType' is not...

No performance gain with baichuan13B comparing with vllm

> @leiwen83 Fast tokenizer is needed to get high performance. There are many new model don't support fast tokenizer, so for those model, and for those model finetuned without tokenizer.json,...

No performance gain with baichuan13B comparing with vllm

> @leiwen83 What gpu card and setting did you use for testing? When using a slow tokenizer, LightLLM should not be slower than VLLM either. I am testing llama7B with...

baichuan13B load error

it could be fixed by below change: ``` diff --git a/lightllm/common/basemodel/layer_weights/hf_load_utils.py b/lightllm/common/basemodel/layer_weights/hf_load_utils.py index 30be3a5..d9ef3ad 100644 --- a/lightllm/common/basemodel/layer_weights/hf_load_utils.py +++ b/lightllm/common/basemodel/layer_weights/hf_load_utils.py @@ -15,7 +15,8 @@ def load_hf_weights(data_type, weight_dir, pre_post_layer=None, transformer_laye candidate_files =...

Qwen1.5量化结果不一致

我这边的主要述求是能够和达到官方一致的量化效果。请问目前GPTQ和AWQ官方发布版本校准dataset是使用的是哪一个？用户可以本地重现量化结果吗？。。

[Kernel][RFC] Refactor the punica kernel based on Triton

Currently this imp still has two kernel dealing with shrink and expand separately. I wonder whether we could merge them into one? So that triton could do the pipeline autotune...

[RFC]: Implement disaggregated prefilling via KV cache transfer

Sounds very interesting! For the second usage, I have a question ``` The user want to query a fixed set of long documents (examples: software manual, internal documents, etc). In...

How to track part of function with frida/qbdi?

Hi @nsurbay , Thanks for your reply! Here is the scripts and function I am try to play with: There are two functions, saying FunA and FunB, which are interpreted...

How to track part of function with frida/qbdi?

I notice there is note in the git's readme: ``` A current limitation is that QBDI doesn't handle signals, multithreading (it doesn't deal with new threads creation) and C++ exception...