[Feature] support s-lora in turbomind backend
Motivation
In downstream tasks, Lora is one of the most common way to finetune llm. The inference speed degrades awfully from [turbomind backend+merge lora ]to [pytorch backend + merge lora] to [pytorch backend+s-lora](from 1x to 0.6x to 0.4x). Is there any chance to have a [turbomind backend + s-lora] to short the chain and boost the speed?
Related resources
No response
Additional context
No response
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.
I agree! In deployment, sometimes we need more than 2 adapters to do different jobs. So it's meaningful if turbomind will support s-lora.
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.
How many adapters do you need? Turbomind will only support lora without the "s-" in the future.
ok~ , typical more than 2 adapters in deployment, s-lora can save gpu memory i guess.
Hi~ Have you found any methods to support multi-lora ?