Junru Shao
Junru Shao
Quantization plays an important role in memory reduction if you wanted to run a larger model with consumer-class GPUs, so please turn it on :-)
I believe Vulkan is supported according to #15. On the other hand, A100 is an extremely powerful GPU, so why not simply run Huggingface's pytorch models directly?
The technical path we are using are quite different from Llama.cpp. MLC LLM primarily uses a compiler to generate efficient code targeting multiple CPU/GPU vendors, while Llama.cpp focuses on handcrafting....
We will have tutorials on making use of MLC LLM APIs in Python/Javascript/Java/Swift, etc
I actually think this interface is nice wrapper on top of `mlc_chat_cli`
Yep please use this repo: https://github.com/mlc-ai/relax
Windows should work in my experiments
To be clear, TVM Unity has both ROCm/Vulkan backend, which means we do not necessarily have to depend on ROCm like what Triton does. At the moment, I believe Vicuna-7b...
Please use `mlc_chat_cli` instead
Closing as the issue seems resolved :-)