wentao issues

Results 40 issues of


                                            wentao

[performance]: Why is the performance of the qwen3-4b model worse than that of qwen2.5-7b?

### Model Series Qwen3 ### What are the models used? qwen3-4b ### What is the scenario where the problem happened? performance qwen3-4b vs qwen2.5-7b-instruct ### Is this a known issue?...

[BUG] Omni-modal inference python3 -m align_anything.serve.omni_modal_cli --model_name_or_path openbmb/MiniCPM-o-2_6 not work！

### Required prerequisites - [x] I have read the documentation . - [x] I have searched the [Issue Tracker](https://github.com/PKU-Alignment/align-anything/issues) and [Discussions](https://github.com/PKU-Alignment/align-anything/discussions) that this hasn't already been reported. (+1 or comment...

bug

[bug]xp1d not working properly！！！

**Describe the bug** xp1d not working properly! **Configuration Information** NVIDIA DEVICE 4090X24GB VLLM 0.9.0.1 lmcache 0.3.1.dev12 **Test Command** ``` # p1 UCX_TLS=cuda_ipc,cuda_copy,tcp \ LMCACHE_CONFIG_FILE=/3rdparty/LMCache/examples/disagg_prefill/xp1d/configs/lmcache-prefiller-config.yaml \ VLLM_ENABLE_V1_MULTIPROCESSING=1 \ VLLM_WORKER_MULTIPROC_METHOD=spawn \ CUDA_VISIBLE_DEVICES=3...

ib_write_bw -a or not？

I am currently testing the write RDMA bandwidth of the IB network card. I used two methods to test it: The first： ``` #server watch ib_write_bw -d mlx5_0 -q 1...

[ADVICE] Does it support visualization of benchmark results of vllm and sglang frameworks?

We are currently testing the performance of large language models, involving benchmarks at different concurrency levels/qps, as follows: ``` ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf...

🏠 advice

[Bug] ERROR:The number of retrieved tokens is less than the expected number of tokens! This should not happen!

**describe：** Currently I'm getting the following error when using disk caching： ``` [32;20m[2025-09-24 07:10:25,904] LMCache INFO:[0m Reqid: chatcmpl-7b3868e5f218403781044c51450ded2c, Total tokens 2774, LMCache hit tokens: 2560, need to load: 2560 [3m(vllm_v1_adapter.py:739:lmcache.integration.vllm.vllm_v1_adapter)[0m...

stale

paddle-onnx-gemm操作和torch-onnx-gemm操作不一致问题

**问题描述：** 目前使用paddle框架训练repvgg分类模型，然后进行转onnx，之后在进行转caffe，但是发现转caffe的时候，遇到了op不支持的问题。同样使用mmcls-torch框架训练resnet18分类模型，然后进行转哦你那些，之后进行转caffe，发现转caffe是可以的。比对了，两个分类模型onnx拓扑图，发现均存在gemm操作。具体如下图： ![repvgg](https://github.com/PaddlePaddle/PaddleClas/assets/37217594/27bb3b74-aefd-4575-a8c9-d5c527365141) ![resnet18](https://github.com/PaddlePaddle/PaddleClas/assets/37217594/544c436d-9410-4445-8a0a-c0148f369c24) 考虑到caffe框架比较老，目前想通过对paddle-repvgg的gemm操作做修改来实现caffe模型的转换。请问是否有好的建议或方法？谢谢！

[Bug]: 4090 run Qwen3-Omni-30B-A3B-Instruct failed！

### Your current environment ``` Package Version Editable project location --------------------------------- ------------- ------------------------------------------------- accelerate 1.12.0 aiofile 3.9.0 aiofiles 24.1.0 aiohappyeyeballs 2.6.1 aiohttp 3.12.15 aiosignal 1.4.0 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio...

bug

[Feature]: Are there any plans to support CPU offloading for KV cache?

### 🚀 The feature, motivation and pitch Are there any plans to support CPU offloading for KV cache? Currently, we've observed that multimodal KV cache consumes significant resources. For example,...

[Feature]: When will support for AWQ int4 quantized model inference be available?

### 🚀 The feature, motivation and pitch Supports inference with the cpatonn/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit model. ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a new issue... - [x]...