sd-webui-text2video
sd-webui-text2video copied to clipboard
[Feature Request]: Add LoRA
Please add LoRA to ModelScope !
We'll try, but it is going to be an extremely challenging process since it will require changing 2D lora layers to 3D ones, and there can even be complications due to the stuff being temporal
Any help on the issue is appreciated!
Thanks you for all your hard work I really love ModelScope :)
https://github.com/huggingface/diffusers/issues/2789 — the work on LoRAs should start soon, hopefully
Hi! I created LoRA.
it is going to be an extremely challenging process since it will require changing 2D lora layers to 3D ones
Happy to see how I can help if you could elaborate on the challenge.
Hi! Pleased to see you here 🙂
Tbh, I haven't tried implementing it here, yet, so I don't know the inner works of LoRA very much, except for it injecting stuff into Linear, Conv2d and MultiheadAttention layers. If I get it correctly, for it to work here Conv3d and TemporalTransformer are to be modified too.
The main question I have is how it would behave with the TemporalTransformer and where the lora things should be added to it here https://github.com/deforum-art/sd-webui-modelscope-text2video/blob/e9e6eb04fdf0d1557674eed439a024c790449374/scripts/t2v_model.py#L556, since Conv3d is a standard function, and the TT is a custom introduced class
Another question: would it be possible to make LoRAs only from images and then insert the concepts into the network? Like if I want the network to animate a character from a few image-arts I have
Here's how I think about LoRA. Whenever there's a weight tensor, e.g., nn.Linear, nn.conv2d, that we first pretrain (to obtain W_0) and then finetune (to obtain W_1), we can freeze the pretrained tensor (W_0) and reparametrize the difference (\Delta W = W_1 - W_0) using lower-rank tensors, i.e., \Delta W = U @ D where D is a down projection and U is an up projection.
We can modify conv3d and other layers used in the Transformer. In practice, modifying a subset of the layers suffices, e.g., we did q and v in self-attn for GPT-3, and it was good enough.
Another question: would it be possible to make LoRAs only from images and then insert the concepts into the network? Like if I want the network to animate a character from a few image-arts I have
If something can be done with finetuning, it can probably also be done with LoRA much more cheaply and maybe more sample-efficiently.
Hope this help!
can we work with embeddings meanwhile?