flux-fp8-api icon indicating copy to clipboard operation
flux-fp8-api copied to clipboard

Hot Lora Replacement

Open Lantianyou opened this issue 1 year ago • 17 comments

Currently, seems lora is loaded ahead API server is up. Is there a way to load lora on request, and after this request finished, just unload the lora.

Lantianyou avatar Sep 21 '24 10:09 Lantianyou

Thanks for your work @aredden

Lantianyou avatar Sep 21 '24 10:09 Lantianyou

I think this would be awesome, I could work on it, though main issue is that I would need to figure whether merging a lora and then unmerging it would effect the original weights. I will look into it since that would be nice to have as an option.

aredden avatar Sep 21 '24 16:09 aredden

Thank you for your reply. I will also try to implement it, although I am not an expert in this area

Lantianyou avatar Sep 21 '24 18:09 Lantianyou

I did some google search, I think PEFT claim they can merge and unmerge lora, but no details explained:

https://discuss.huggingface.co/t/can-i-dynamically-add-or-remove-lora-weights-in-the-transformer-library-like-diffusers/87890

https://stackoverflow.com/questions/78518971/can-i-dynamically-add-or-remove-lora-weights-in-the-transformer-library-like-dif

Lantianyou avatar Sep 21 '24 18:09 Lantianyou

To unload the lora, I tried to load same lora with scale=-1, but run out of CUDA memory on 4090 24G

Lantianyou avatar Sep 21 '24 18:09 Lantianyou

I guess to unmerge the lora cleanly, you have to save the original weights first somewhere, but it would introduce performance overhead.

Lantianyou avatar Sep 21 '24 18:09 Lantianyou

Yeah- that's the problem- you wouldn't want to keep the lora weights in memory- you would want to fuse them into the weights, but if you fuse them into the weights- it could result in degrading the original weights after many weight fuses and unfuses.

aredden avatar Sep 22 '24 02:09 aredden

True and true

Lantianyou avatar Sep 22 '24 02:09 Lantianyou

So I implemented it but it's not ready for a push- seems to work well though! Includes loading and unloading, and added a web endpoint for it.

aredden avatar Sep 22 '24 03:09 aredden

Would you mind pushing the code to a different branch, so I can test it?

Lantianyou avatar Sep 22 '24 03:09 Lantianyou

Alright I pushed to 'removable-lora' https://github.com/aredden/flux-fp8-api/tree/removable-lora - you can test it if you want- though it's currently not in the webapi, would have to test it via a script @Lantianyou

aredden avatar Sep 22 '24 16:09 aredden

Thank you a lot, will get back you the results

Lantianyou avatar Sep 22 '24 16:09 Lantianyou

Alright I pushed to 'removable-lora' https://github.com/aredden/flux-fp8-api/tree/removable-lora - you can test it if you want- though it's currently not in the webapi, would have to test it via a script @Lantianyou

I tested this branch and found that when uninstalling lora on a single card 4090, OOM would occur.

81549361 avatar Sep 26 '24 08:09 81549361

@aredden I can successfully uninstall Lora immediately after loading it, but if I uninstall it after performing an inference, OOM will occur.

81549361 avatar Sep 28 '24 07:09 81549361

Ah- I guess it might need some work with cleaning up the loras after unloading / unloading. I will work on this, thanks @81549361

aredden avatar Sep 30 '24 16:09 aredden

Ah- I guess it might need some work with cleaning up the loras after unloading / unloading. I will work on this, thanks @81549361

Thank you very much, your repo is awesome!

81549361 avatar Sep 30 '24 16:09 81549361

Alright I merged it into the main branch

aredden avatar Oct 03 '24 23:10 aredden