lmdeploy [Feature] support s-lora in turbomind backend

Motivation

In downstream tasks, Lora is one of the most common way to finetune llm. The inference speed degrades awfully from [turbomind backend+merge lora ]to [pytorch backend + merge lora] to [pytorch backend+s-lora](from 1x to 0.6x to 0.4x). Is there any chance to have a [turbomind backend + s-lora] to short the chain and boost the speed?

Related resources

No response

Additional context

No response

Sep 12 '24 10:09 torinchen

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

Sep 12 '24 13:09 lzhangzz

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.

Sep 13 '24 02:09 torinchen

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.

I agree! In deployment, sometimes we need more than 2 adapters to do different jobs. So it's meaningful if turbomind will support s-lora.

Oct 01 '24 08:10 zzf2grx

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

Oct 09 '24 02:10 github-actions[bot]

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

Oct 14 '24 02:10 github-actions[bot]

How many adapters do you need? Turbomind will only support lora without the "s-" in the future.

ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.

Hi~ Have you found any methods to support multi-lora ?

Nov 05 '24 12:11 zzf2grx