intervitens
intervitens
This PR adds a custom floating point quantization method powered by [TorchAO](https://github.com/pytorch/ao), which achieves a high throughput, thanks to the optimized [fp6_llm](https://github.com/usyd-fsalab/fp6_llm) kernel. Use `-q torchao --torchao-fp-bits 6` to load...
#### Context - [ ] add a new feature - [x] fix a bug - [ ] update tests and/or documentation - [ ] other (please add here) #2659 added...
This PR adds support for Qwen3 MoE (30B-A3B and 235B-A22B) models. Loss looked reasonable from a simple test with 30B-A3B on the Alpaca dataset. TODO: - [ ] Tensor/Expert parallel...