q yao

Results 34 issues of q yao

Hi I am a little bit confused with ARF cuda kernel. https://github.com/ZhouYanzhao/ORN/blob/d6b38aa5e5c3ca7c6e3d0ed5770e581ee1daadcd/src/orn/lib/active_rotating_filters.cu#L19-L33 Let's say, assume thread 0 and thead 1 has: i0 == i1 j0 == j1 k0 == k1...

https://github.com/open-mmlab/mmcv/issues/2933#issuecomment-1758931803

- triton 2.1.0 has best performance - parse signature (in 2.2.0 and 2.3.0) cost a lot. - 2.3.0 does not accept device and stream.

improvement

Implementation of https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407 I plan to refactor the implementation of s-lora so we do not need to change block size when enabling adapters. @zhyncs @ispobock

enhancement

similar optimization https://github.com/InternLM/lmdeploy/pull/1515 for deepseek-moe, qwen2-moe, dbrx

improvement

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

```bash python3 \ benchmark/profile_throughput.py \ ShareGPT_V3_unfiltered_cleaned_split.json \ Mixtral-8x22B-v0.1 \ --backend pytorch \ --cache-max-entry-count 0.65 \ --num-prompts 3000 \ --concurrency 256 \ --tp 4 ``` ``` -------------------------------------------------- concurrency: 256 elapsed_time: 736.060s...

improvement

- block size won't change after apply s-lora - remove slice op

improvement

Enable by set `shared_cache=True`.

Kernel would not recompile when M changes. The performance on small batch size and short context is still not fast enough. Since triton kernel launch takes too much time.

improvement