q yao
q yao
Hi I am a little bit confused with ARF cuda kernel. https://github.com/ZhouYanzhao/ORN/blob/d6b38aa5e5c3ca7c6e3d0ed5770e581ee1daadcd/src/orn/lib/active_rotating_filters.cu#L19-L33 Let's say, assume thread 0 and thead 1 has: i0 == i1 j0 == j1 k0 == k1...
https://github.com/open-mmlab/mmcv/issues/2933#issuecomment-1758931803
- triton 2.1.0 has best performance - parse signature (in 2.2.0 and 2.3.0) cost a lot. - 2.3.0 does not accept device and stream.
Implementation of https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2044203407 I plan to refactor the implementation of s-lora so we do not need to change block size when enabling adapters. @zhyncs @ispobock
similar optimization https://github.com/InternLM/lmdeploy/pull/1515 for deepseek-moe, qwen2-moe, dbrx
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...
```bash python3 \ benchmark/profile_throughput.py \ ShareGPT_V3_unfiltered_cleaned_split.json \ Mixtral-8x22B-v0.1 \ --backend pytorch \ --cache-max-entry-count 0.65 \ --num-prompts 3000 \ --concurrency 256 \ --tp 4 ``` ``` -------------------------------------------------- concurrency: 256 elapsed_time: 736.060s...
Enable by set `shared_cache=True`.
Kernel would not recompile when M changes. The performance on small batch size and short context is still not fast enough. Since triton kernel launch takes too much time.