litgpt
litgpt copied to clipboard
Smart choice of the inference algorithm
generate/base.py
and generate/chat.py
(uses the former) assume that the model fits in memory.
There's generate/sequentially.py
and generate/tp.py
that support using multiple devices.
To streamline the experience, we could have the chat
or generate
entrypoints choose one implementation based on the model config (size, sequence length) and available hardware (memory, number of devices). This would require #921.
A smart, automatic choice would be nice but maybe this should be a feature flag.
Maybe something like
--optimize smart (default) / memory / flops
where
-
"memory"
usessequential.py
(but what about applying quantization?) -
"flops"
uses currentbase.py
implementation
I think --devices
can be a separate argument to make it more similar to the finetuning scripts
I don't see why you would want anything other than "flops" if it fits in a single device. If it doesn't, you are forced to use one of the other techniques.
What if there is only a single device, but the model doesn't fit? Shouldn't the code switch to layers offloading? I think DeepSpeed strategy from Fabric supports it.
The sequentially.py
file could support it too if we want to.
However, transformer inference at batch size 1 is already very latency bound so this would make it even worse. It wouldn't be usable for anything serious
I think @lantiga is interested in layers offloading. Ollama/LlaMA.cpp uses layers offloading, but they definitely use something more to achieve decent latency. The question is can we get something similar with DeepSpeed, or it might require significant changes.