Eric Buehler
Eric Buehler
Refs: - https://github.com/deepseek-ai/DeepSeek-V3/blob/4cc6253d5c225e2c5fea32c54573449c1c46470a/inference/model.py#L443 - https://github.com/sgl-project/sglang/pull/905/files#diff-5b9e34dd492bd8a14702a18b594721091092276fad1cf8736fba6ef1f33c1b04 - https://github.com/InternLM/lmdeploy/pull/1621/files#diff-daef4154c2a77eba9f2e444df958cc19b318ce248c09995080b344b174522dc5
- [x] Config - [ ] Qwen2_5OmniThinkerForConditionalGeneration (text + image + audio **in**, text **out**) - [ ] Qwen2_5OmniAudioEncoder - [ ] Qwen2_5OmniVisionEncoder - [x] Qwen2_5OmniThinkerTextModel - [ ] Qwen2_5OmniTalkerForConditionalGeneration...
At its core, a ring-based All Reduce algorithm backend. This enables tensor parallelism for Metal users!
Removing contiguous calls in rmsnorms and RoPE might help.
Support FlashMLA for improved throughput for MLA models (DeepSeek V2, V3/R1) on CUDA. https://github.com/EricLBuehler/candle/pull/74 https://github.com/deepseek-ai/FlashMLA
- Add the whisper model