mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

CPU offloading

Open LiliumSancta opened this issue 2 years ago • 6 comments

Incredible project, i managed to run the model with good speed on my hardware (AMD) thanks. I have a question do you have any plans to offload the weights and be able to run bigger models like 13B or 30B with less vram?

LiliumSancta avatar Apr 29 '23 13:04 LiliumSancta

Hey thanks for your interest! Our backend (TVM Unity) supports AMD CPUs out of box, so it wouldn’t be too challenging (likely tens of lines) to introduce support for them. Not too sure about the latency aspect though

junrushao avatar Apr 29 '23 13:04 junrushao

Yeah, the limitations current LLM inference programs like Oobabooga WebUI and KoboldAi have is that CPU offloading is very slow for these. Maybe the MLC team could build a very fast CPU offloader that allocates RAM on the fly as soon as VRAM is overflowing, to prevent out of memory errors at high context sizes and with big models and still being relatively fast.

Dampfinchen avatar Apr 29 '23 20:04 Dampfinchen

The dream (for my hardware) is being able to split up the model between separate vulkan devices... maybe splitting up layers like llama.cpp does? This would allow for hybrid IGP+GPU inference, or multi GPU splitting.

Splitting the model between multiple backends is probably outside the domain of tvm though, right?

AlphaAtlas avatar Jun 08 '23 22:06 AlphaAtlas

as someone that currently does cpu inference I would love this feature.

sirus20x6 avatar Sep 21 '23 14:09 sirus20x6

Any updates?

MikeLP avatar Feb 03 '24 04:02 MikeLP