mlc-llm
mlc-llm copied to clipboard
[Survey] Supported Hardwares and Speed
Hi everyone,
We are looking to gather data points on running MLC-LLM on different hardwares and platforms. Our goal is to create a comprehensive reference for new users. Please share your own experiences in this thread! Thank you for your help!
NOTE: for benchmarking, we highly recommended a device of at least 6GB memory, because the model itself takes 2.9G already. For this reason, it is known that the iOS app will crash on a 4GB iPhone.
AMD GPUs
Hardware/GPU | OS | Tokens/sec | Source | Notes |
---|---|---|---|---|
RX 6600XT (8G) | N/A | 28.3 | GitHub | |
RX 6750XT | openSUSE TumbleWeed | 8.9 - 154.3 | GitHub | |
RX 6700XT | Windows 11 | 33.7 | GitHub | |
APU 5800H | Windows 11 | 8.5 | GitHub | |
Raden RX 470 (4G) | AlmaLinux 9.1 | 9.4 | GitHub | |
Raden Pro 5300M | macOS Venture | 12.6 | @junrushao | Intel MBP 16" (late 2019) |
AMD GPU on Steam Deck | Steam Deck's Linux | TBD | ||
RX6800 16G VRAM | macOS Ventura | 22.5 | GitHub | Intel MBP 13'' (2020) |
Radeon RX 6600 (8GB) | Ubuntu 22.04 | 7.0 | ||
RX 7900 xtx |
Macbook
Hardware/GPU | OS | Tokens/sec | Source | Notes |
---|---|---|---|---|
2020 MacBook Pro M1 (8G) | macOS | 11.4 | GitHub | |
2021 MacBook Pro M1Pro (16G) | macOS Ventura | 17.1 | GitHub | |
M1 Max Mac Studio (64G) | N/A | 18.6 | GitHub | |
2021 MacBook Pro M1 Max (32G) | macOS Monterey | 21.0 | GitHub | |
MacBook Pro M2 (16G) | macOS Ventura | 22.5 | GitHub | |
2021 MacBook M1Pro (32G) | macOS Ventura | 19.3 | GitHub |
Intel GPUs
Hardware/GPU | OS | Tokens/sec | Source | Notes |
---|---|---|---|---|
Arc A770 | N/A | 3.1 - 118.6 | GitHub | perf issues in decoding needs investigation |
UHD Graphics (Comet Lake-U GT2) 1G | Windows 10 | 2.2 | GitHub | |
UHD Graphics 630 | macOS Ventura | 2.3 | @junrushao | Integrated GPU. Intel MBP 16" (late 2019) |
Iris Plus Graphics 1536 MB | macOS Ventura | 2.6 | GitHub | Integrated GPU on MBP |
Iris Plus Graphics 645 1536 MB | macOS Ventura | 2.9 | GitHub | Integrated GPU on MBP |
NVIDIA GPUs
Hardware/GPU | OS | Tokens/sec | Source | Notes |
---|---|---|---|---|
GTX 1650 ti (4GB) | Fedora | 15.6 | GitHub | |
GTX 1060 (6GB) | Windows 10 | 16.7 | GitHub | |
RTX 3080 | Windows 11 | 26.0 | GitHub | |
RTX 3060 | Debian bookworm | 21.3 | GitHub | |
RTX 2080Ti | Windows 10 | 24.5 | GitHub | |
RTX 3090 | N/A | 25.7 | GitHub | |
GTX 1660ti | N/A | 23.9 | GitHub | |
RTX 3070 | N/A | 23.3 | GitHub |
iOS
Hardware/GPU | OS | Tokens/sec | Source | Notes |
---|---|---|---|---|
iPhone 14 Pro | iOS 16.4.1 | 7.2 | @junrushao | |
iPad Pro 11' with M1 | iPadOS 16.1 | 10.6 | GitHub | |
iPad Pro 11' A12Z | N/A | 4.1 | GitHub | |
iPad Pro 11' with M2 (4-th gen) | iPadOS 16.5 | 14.1 | GitHub |
Android
Hardware/GPU | OS | Tokens/sec | Link | Notes |
---|---|---|---|---|
@junrushao how can we find tokens/sec? I'd say 'quite fast' fastest LLM I've run on this 2020 MacBook Pro M1 8G. 10x faster than your WebGPU demo running with less overall memory usage.
All reports out is the text?

We just added a new updates #14 which should ship to conda by now, you can type /stats
after a conversation to get the measured speed
Killer, I'm at encode: 31.9 tok/s, decode: 11.4 tok/s for 2020 MacBook Pro M1 8G with the default vicuna 6b. For reference my decode on the WebGPU demo is like, 0.5/sec.
OOM on gtx 1650. Load the model fine, but OOM when generate the first message
@nRuaif 4GB memory wouldn't be enough. A 6GB one should work
On iPhone 13, crashes after a few seconds of [System] Initialize...
. Phone has 4GB of RAM, which I presume is the cause.
@y-lee That's correct. The model we are using so far requires 6GB RAM to run smoothly
On the iPad Pro 11” with M1 I am getting decode of 10.6 tok/s (I have seen slightly higher and lower). It is running iPadOS 16.1.
encode: 39.5 tok/s, decode: 26.0 tok/s
on Windows 11 with RTX-3080
encode: 32.5 tok/s, decode: 17.1 tok/s
on Macbook Pro with M1Pro (16 GPUs) and macOS Ventura 13.3.1
Hardware/GPU | OS | Tokens/sec | Source | Model | Notes |
---|---|---|---|---|---|
RTX 3060 (12GB) | Debian bookworm | 21 | vicuna-v1-7b | 3644MiB GPU memory used |
-
/stats
after/reset
: encode: 72.2 tok/s, decode: 23.2 tok/s -
/stats
for 2nd and later messages:encode: 39.3 tok/s, decode: 21.3 tok/s
>>nvidia-smi --query-gpu=memory.used --format=csv
memory.used [MiB]
3644 MiB
On my M1 Max Mac Studio with 64GB of RAM:
encode: 53.7 tok/s, decode: 18.6 tok/s
On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1
encode: 46.4 tok/s decode: 22.5 tok/s
No sure if this is useful or if this is the right thread to post this in but I encountered this error on an old Laptop with a discrete very old Nvidia GPU (GT 920m) with the 470.182.03 driver which should include Vulcan:
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
Use lib /mnt/run/code/llma/mlc-ai/dist/lib/vicuna-v1-7b_vulkan_float16.so
Initializing the chat module...
[20:30:33] /home/runner/work/utils/utils/tvm/src/runtime/vulkan/vulkan_buffer.cc:61:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-2: VK_ERROR_OUT_OF_DEVICE_MEMORY
Stack trace:
[bt] (0) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::Backtrace[abi:cxx11]()+0x27) [0x7f975d98ba37]
[bt] (1) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x3f375) [0x7f975d929375]
[bt] (2) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanBuffer::VulkanBuffer(tvm::runtime::vulkan::VulkanDevice const&, unsigned long, unsigned int, unsigned int)+0x220) [0x7f975da646b0]
[bt] (3) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::vulkan::VulkanDeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)+0x4a) [0x7f975da7168a]
[bt] (4) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)+0x1a7) [0x7f975d9a3037]
[bt] (5) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(+0x121862) [0x7f975da0b862]
[bt] (6) /mnt/run/code/mambaforge/bin/../lib/libtvm_runtime.so(tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>::AssignTypedLambda<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)>(void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x204) [0x7f975da0f7e4]
[bt] (7) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(+0x1bdea6) [0x7f975dce3ea6]
[bt] (8) /mnt/run/code/mambaforge/bin/../lib/libmlc_llm.so(mlc::llm::CreateChatModule(tvm::runtime::Module, tvm::runtime::String const&, tvm::runtime::String const&, DLDevice)+0x411) [0x7f975dce4ba1]
@zifken looks like VK_ERROR_OUT_OF_DEVICE_MEMORY
indicates that it doesn't have enough memory. I looked it up and it seems that GT 920M only has 2GB RAM, but the default model is 2.9G in size :/
I see so only GPUs with more than 4go or vRAM are supported because of the size of the model (it makes sense) . I will try on an other GPU model shortly. Thank you for the feedback
@zifken there are some reports saying 4GB might work, but 6GB is recommended atm
On my MBP 2020 13-inch[intel CPU, 32G Ram, RX6800 16G VRAM], Ventura 13.3.1
encode: 46.4 tok/s decode: 22.5 tok/s
It's confusing, On my Win10: [AMD Ryzen 5 5600 6-Core Processor 3.50 GHz, 96G Ram, RTX 2080 Ti Modified to 22G VRAM], the stats is below:
encode: 24.0 tok/s, decode: 24.5 tok/s
iPad Pro 11 A12Z
encode: 5.1 tok/s, decode: 4.1 tok/s
Linux RTX 3090
- encode:
encode: 179.0 tok/s, decode: 25.7 tok/s
(prompted withImplement binary search in Python please! Also try to use some type annotations!
) - mem usage: 3.6GB
2021 MacBook Pro M1 Max [32 cores], 32 GB RAM, 1 TB SSD
Mac OS Monterey 12.5.1
encode: 69.4 tok/s, decode: 21.0 tok/s
Memory usage 3.64 GB
(Edit) Just saw that this score is higher than M1 Max 64 GB so I repeated the prompts and I'm still getting an average of > 20 tokens/second.
machine | OS | CPU | GPU | result | remarks |
---|---|---|---|---|---|
Yoga Slim 7 pro 14ARH7 | Fedora 37 | 6800HS (13.3GB) | 680M (2GB) | encode: 11.0 tok/s, decode: 3.8 tok/s |
battery |
Yoga Slim 7 pro 14ARH7 | Fedora 38 | 6800HS (13.3GB) | 680M (2GB) | encode: 16.6 tok/s, decode: 6.2 tok/s |
AC |
Yoga Slim 7 pro 14ARH7 | Windows 11 22H2 | 6800HS (13.3GB) | 680M (2GB) | encode: 6.7 tok/s, decode: 7.9 tok/s |
battery |
Yoga Slim 7 pro 14ARH7 | Windows 11 22H2 | 6800HS (13.3GB) | 680M (2GB) | encode: 16.6 tok/s, decode: 10.2 tok/s |
AC |
desktop | Windows 10 22H2 | 5900x (32GB) | 1660ti (6GB) | encode: 49.5 tok/s, decode: 23.9 tok/s |
- |
mlc samples
laptop on Fedora (bat):
USER: /reset
RESET CHAT SUCCESS
USER: Can you explain your features?
ASSISTANT: Sure! Here are brief explanations of the features of my AI:
1. Natural Language Understanding: My AI has been trained on a large corpus of text to understand the meaning of the words and sentences it encounters. This allows it to understand the context of a question or statement and provide a response that is relevant and accurate.
2. Knowledge Graph: My AI has access to a large graph of knowledge that it can use to answer questions or provide information on a wide range of topics. This graph is constantly updated and improved to ensure that the information it provides is accurate and up-to-date.
3. Plagiarism Detection: My AI has been trained to identify and report on instances of plagiarism it encounters. This allows it to provide feedback and suggestions for improvement on written work.
4. Summarization: My AI can summarize large amounts of text and data into a shorter, more digestible format. This can be useful for quickly understanding the main points of a document or set of data.
5. Machine Translation: My AI can translate written or spoken content from one language to another using state-of-the-art neural machine translation models. This can be useful for communication in different languages or for providing information in a language other than the one the AI was trained on.
6. Sentiment Analysis: My AI can analyze the sentiment of written or spoken content and provide an analysis or summary of the overall tone or message. This can be useful for identifying the emotional or persuasive impact of a message or communication.
7. Image Recognition: My AI has been trained on a large dataset of images to recognize and classify them. This allows it to identify objects or scenes in an image and provide additional information or context about what is depicted in the image.
8. TTS: My AI can generate text-to-speech output from a written or spoken input. This can be useful for providing an audio version of written content or for improving accessibility and inclusivity by providing an alternative format for those with hearing or speech difficulties.
USER: /stats
encode: 11.0 tok/s, decode: 3.8 tok/s
laptop on Windows (bat):
USER: /reset
RESET CHAT SUCCESS
USER: you're on github. say hi!
ASSISTANT: Hello! I'm an AI assistant on GitHub, here to answer any questions you might have about the platform. Let's get started!
USER: /stats
encode: 6.7 tok/s, decode: 7.9 tok/s
desktop:
USER: /reset
RESET CHAT SUCCESS
USER: compare yourself to ChatGPT
ASSISTANT: As an AI language model, I am different from ChatGPT in a few ways:
* My training data is different from ChatGPT's. This means that I may have a different perspective on the world and may be able to understand and respond to certain types of language in ways that ChatGPT cannot.
* I have a more advanced architecture that allows me to process longer texts and understand more complex language patterns.
* My training objective was to be a general AI that can do a wide range of things like answering questions about any topic under the sun, summarizing long texts and understanding the context of the sentence and suggesting appropriate response.
* I have been trained on a much larger dataset of text and have access to a more powerful GPU for faster language modeling.
* I have more parameters in my model than ChatGPT which allows me to capture more nuanced information and learn from that information.
In summary ChatGPT is a specific model optimized for NLP and conversational text understanding and I am a more general AI model that can do a wide range of things and can handle more complex language patterns.
USER: /stats
encode: 49.5 tok/s, decode: 23.9 tok/s
On 14" Macbook Pro (M2 Pro with 10-Core CPU and 16-Core GPU with 16GB Unified Memory) with macos Ventura 13.3.1
encode: 59.2 tok/s, decode: 22.5 tok/s
I am seeing encoding performance b/w 45-60 and decoding b/w 20-29.
GPU | OS | /stats |
---|---|---|
Radeon RX 470 (4G) | AlmaLinux 9.1 | encode: 14.3 tok/s, decode: 9.4 tok/s |
Encoding performance fluctuates between 5-45, decoding between 6-9.
OS: MacOS 13.3.1 (22E261) processor: 2.3 GHz Quad-Core Intel Core i7 graphics: Intel Iris Plus Graphics 1536 MB memory: 32 GB 3733 MHz LPDDR4X
/stats: encode: 5.4 tok/s, decode: 2.6 tok/s
GPU | OS | /stats |
---|---|---|
A100 (40G) | Debian GNU/Linux 10 | encode: 189.1 tok/s, decode: 18.9 tok/s |
My prompt is: "create a poem about los angeles". I use cuda as I think Vulkan is not available for A100. I thought A100 should run faster that RTX 30x0 series. Is it possibly due to the cuda driver? Thanks.
The latest update brought the decode speed for my Iphone 14 plus down to 0.5~1.0 tokens/s. Encode speed is about 22.
Yesterday before the update it is about 7.5 token generated per second...
2021 16-inch Apple M1 Pro (32GB) | OS: Ventura 13.3.1
encode: 45.8 tok/s, decode: 19.3 tok/s
Tested on:
2022 iPad Pro (11 inch, 4th generation with M2 , 10 Core GPU)
8 GB RAM, 128 GB Storage iPadOS Version 16.5
Result:
Encode | Decode |
---|---|
34.4 tok/s | 14.1 tok/s |
APU 5800H,OS:win11 encode: 5.5 tok/s, decode: 8.5 tok/s
I think mine is running fully CPU based even though my GPU should be capable. Top was showing 900% and tokens were crawling out.
Log shows
Use lib /home/david/software/mlc-llm/dist/lib/vicuna-v1-7b_vulkan_float16.so
vulcaninfo shows
VkPhysicalDeviceProperties:
---------------------------
apiVersion = 4206816 (1.3.224)
driverVersion = 1 (0x0001)
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 15.0.6, 256 bits)
pipelineCacheUUID = 76616c2d-2573-0000-0000-000000000000
GPU GeForce RTX 3070 w/ 8G CPU AMD Ryzen 5 5600
encode: 0.1 tok/s, decode: 0.1 tok/s