zxy comments

Results 27 comments of

zxy

CUDA error: an illegal memory access was encountered

> hi, I met the same problem, in my case, my tensor variables are not in the same device, that's the problem, after I fixed the tensor variables to the...

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi?

I also encounter similar issues. When I test the generation throughput on LongBench using script "scripts/long_test.sh", with the following parameter details: ``` gpuid=0 model='JackFram/llama-160m' quant_method='kivi' k_bits=2 v_bits=2 group_size=32 residual_length=128 e=0...

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi?

> @CUHKSZzxy @lichongod > > Thank you guys for the detailed benchmark. We also notice it in our previous experiments. The longbench is tested under the batch size == 1...

[Bug] 更改session_len无法解决输出截断的问题

@Amber-Believe 可以参考以下代码，把 generation config 中的 max_new_tokens 参数调大。max_new_tokens 是控制输出长度上限的 ``` import os from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig from lmdeploy.vl import load_image os.environ['CUDA_VISIBLE_DEVICES'] = '7' model_path = "xxx" # Configure the...

support qwen3 /think & /no_think & enable_thinking parameter

Thanks again for your dedicated contributions. As I was testing the functionality, how are we expected to use this feature? Currently, I use the following commands after launching the API...

Launch multiple api servers for dp > 1

Failed with `tp1, dp 32, ep 32`, error info: ``` 2025-04-11 04:15:17,172 - lmdeploy - [37mINFO[0m - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=0, dp=32, dp_rank=0, ep=32, session_len=None, max_batch_size=128, cache_max_entry_count=0.8, prefill_interval=16,...

zxy

CUDA error: an illegal memory access was encountered

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi?

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi?

[Bug] 更改session_len无法解决输出截断的问题

support qwen3 /think & /no_think & enable_thinking parameter

Launch multiple api servers for dp > 1

[Bug] Qwen3-235B-A22B-FP8 cannot be loaded using pytorch

[Bug] Qwen3-235B-A22B-FP8 cannot be loaded using pytorch

[Docs] 支持量化压缩Intern-S1 和InternVL3_5-241B-A28B 吗?

[Bug] AttributeError: 'DeepseekVLV2ForCausalLM' object has no attribute 'config'