sd-webui-text2video icon indicating copy to clipboard operation
sd-webui-text2video copied to clipboard

[Feature Request]: Add LoRA

Open alexfredo opened this issue 1 year ago • 7 comments

Please add LoRA to ModelScope !

alexfredo avatar Mar 24 '23 13:03 alexfredo

We'll try, but it is going to be an extremely challenging process since it will require changing 2D lora layers to 3D ones, and there can even be complications due to the stuff being temporal


Any help on the issue is appreciated!

kabachuha avatar Mar 24 '23 14:03 kabachuha

Thanks you for all your hard work I really love ModelScope :)

alexfredo avatar Mar 24 '23 14:03 alexfredo

https://github.com/huggingface/diffusers/issues/2789 — the work on LoRAs should start soon, hopefully

kabachuha avatar Mar 24 '23 15:03 kabachuha

Hi! I created LoRA.

it is going to be an extremely challenging process since it will require changing 2D lora layers to 3D ones

Happy to see how I can help if you could elaborate on the challenge.

edwardjhu avatar Mar 26 '23 13:03 edwardjhu

Hi! Pleased to see you here 🙂

Tbh, I haven't tried implementing it here, yet, so I don't know the inner works of LoRA very much, except for it injecting stuff into Linear, Conv2d and MultiheadAttention layers. If I get it correctly, for it to work here Conv3d and TemporalTransformer are to be modified too.

The main question I have is how it would behave with the TemporalTransformer and where the lora things should be added to it here https://github.com/deforum-art/sd-webui-modelscope-text2video/blob/e9e6eb04fdf0d1557674eed439a024c790449374/scripts/t2v_model.py#L556, since Conv3d is a standard function, and the TT is a custom introduced class

Another question: would it be possible to make LoRAs only from images and then insert the concepts into the network? Like if I want the network to animate a character from a few image-arts I have

kabachuha avatar Mar 26 '23 14:03 kabachuha

Here's how I think about LoRA. Whenever there's a weight tensor, e.g., nn.Linear, nn.conv2d, that we first pretrain (to obtain W_0) and then finetune (to obtain W_1), we can freeze the pretrained tensor (W_0) and reparametrize the difference (\Delta W = W_1 - W_0) using lower-rank tensors, i.e., \Delta W = U @ D where D is a down projection and U is an up projection.

We can modify conv3d and other layers used in the Transformer. In practice, modifying a subset of the layers suffices, e.g., we did q and v in self-attn for GPT-3, and it was good enough.

Another question: would it be possible to make LoRAs only from images and then insert the concepts into the network? Like if I want the network to animate a character from a few image-arts I have

If something can be done with finetuning, it can probably also be done with LoRA much more cheaply and maybe more sample-efficiently.

Hope this help!

edwardjhu avatar Apr 03 '23 15:04 edwardjhu

can we work with embeddings meanwhile?

Natotela avatar May 01 '23 19:05 Natotela