litgpt Smart choice of the inference algorithm

Smart choice of the inference algorithm

Open carmocca opened this issue 11 months ago • 5 comments

generate/base.py and generate/chat.py (uses the former) assume that the model fits in memory.

There's generate/sequentially.py and generate/tp.py that support using multiple devices.

To streamline the experience, we could have the chat or generate entrypoints choose one implementation based on the model config (size, sequence length) and available hardware (memory, number of devices). This would require #921.

Mar 06 '24 01:03 carmocca

A smart, automatic choice would be nice but maybe this should be a feature flag.

Maybe something like

--optimize smart (default) / memory / flops

where

"memory" uses sequential.py (but what about applying quantization?)
"flops" uses current base.py implementation

I think --devices can be a separate argument to make it more similar to the finetuning scripts

Mar 06 '24 18:03 rasbt

I don't see why you would want anything other than "flops" if it fits in a single device. If it doesn't, you are forced to use one of the other techniques.

Mar 06 '24 18:03 carmocca

What if there is only a single device, but the model doesn't fit? Shouldn't the code switch to layers offloading? I think DeepSpeed strategy from Fabric supports it.

Mar 11 '24 21:03 Andrei-Aksionov

The sequentially.py file could support it too if we want to.

However, transformer inference at batch size 1 is already very latency bound so this would make it even worse. It wouldn't be usable for anything serious

Mar 11 '24 22:03 carmocca

I think @lantiga is interested in layers offloading. Ollama/LlaMA.cpp uses layers offloading, but they definitely use something more to achieve decent latency. The question is can we get something similar with DeepSpeed, or it might require significant changes.

Mar 12 '24 12:03 Andrei-Aksionov

litgpt litgpt copied to clipboard

Smart choice of the inference algorithm

litgpt
litgpt copied to clipboard