sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

support fused_back_pass for prodigy-plus-schedule-free

Open michP247 opened this issue 1 year ago • 9 comments

Copied the internals from https://github.com/LoganBooker/prodigy-plus-schedule-free into kohya library/prodigy_plus_schedulefree.py and made training scripts support either prodigyplus or fused adafactor when setting FBP

From my short tests dreambooth training w/ args --fused_backward_pass --optimizer_type="prodigyplus.ProdigyPlusScheduleFree"

sd3.5medium 512x512 res w/ --full_bf16 base prodigy = 27.2 gb vram usage prodigy-plus-schedule-free = 15.4 gb prodigy-plus-schedule-free w/ FBP = 10.2 gb

sdxl 1024x1024 w/ --full_bf16 base prodigy = 33 gb prodigy-plus-schedule-free = 19 gb prodigy-plus-schedule-free w/ FBP = 13 gb

didn't test flux but should be similar gains

michP247 avatar Jan 06 '25 05:01 michP247

wow nice

@michP247 you find this better than other optimizers?

FurkanGozukara avatar Jan 06 '25 09:01 FurkanGozukara

wow nice

@michP247 you find this better than other optimizers?

will check results later, I've haven't actually completed any training in my tests, just did a quick vram check last night lol (edited to mention I was using full bf16). Still need to figure out the correct --prodigy_steps value.

michP247 avatar Jan 06 '25 13:01 michP247

Thanks for this pull request!

But I think it may work with the --optimizer_type and --optimizer_args options, like --optimizer_type "prodigyplus.ProdigyPlusScheduleFree" --optimizer_args "fused_back_pass=True" without any additional implementation. Have you tried this?

kohya-ss avatar Jan 06 '25 13:01 kohya-ss

Thanks for this pull request!

But I think it may work with the --optimizer_type and --optimizer_args options, like --optimizer_type "prodigyplus.ProdigyPlusScheduleFree" --optimizer_args "fused_back_pass=True" without any additional implementation. Have you tried this?

Hm Ok so I just tried and it does already work as an optimizer arg which I overlooked. But at least now it won't break anything if --fused_backward_pass is passed regularly.

michP247 avatar Jan 06 '25 13:01 michP247

Thanks for this pull request! But I think it may work with the --optimizer_type and --optimizer_args options, like --optimizer_type "prodigyplus.ProdigyPlusScheduleFree" --optimizer_args "fused_back_pass=True" without any additional implementation. Have you tried this?

Hm Ok so I just tried and it does already work as an optimizer arg which I overlooked. But at least now it won't break anything if --fused_backward_pass is passed regularly.

Which will be good for bmaltais gui since we'd simply be able to use the FBP checkbox with this optimizer

michP247 avatar Jan 06 '25 13:01 michP247

update:I apologize for the late reply. The issue has been fixed in ProdigyPlusScheduleFree v1.8.3. Thanks to @LoganBooker for his work. I tested the v1.8.4 and it works fine now, no longer needing the modifications from my commit.


Previous comment the issue is about register_post_accumulate_grad_hook and groups_to_process.

I attempted to add fused backward pass to train_network.py my changes: https://github.com/kohya-ss/sd-scripts/compare/sd3...Exist-c:sd-scripts:sd3

Based on the implementation in sdxl_train.py and my tests in train_network.py , I think that the optimizer's step_param should be registered to the parameters, similar to Adafactor. Otherwise, optimizer will do nothing. I'm not certain if Flux or SD3.5 necessitate this, but I thought it would be helpful to mention it. Here is my implementation in train_network.py.

               # accelerator has wrapped the optimizer
               # we need optimizer.optimizer to access the original function.   
                for param_group in  optimizer.optimizer.param_groups:
                    for parameter in param_group["params"]:
                        if parameter.requires_grad:
                            def __grad_hook(tensor: torch.Tensor, param_group=param_group):
                                if accelerator.sync_gradients and args.max_grad_norm != 0.0:
                                    accelerator.clip_grad_norm_(tensor, args.max_grad_norm)
                                optimizer.optimizer.step_param(tensor, param_group)
                                tensor.grad = None  # clear grad to save memory
                            parameter.register_post_accumulate_grad_hook(__grad_hook)

And in my implementation, if both the text_encoder and unet are traning, parameters of the next step would be prematurely called using step_param , leading to errors. I made some modifications in on_end_step() ,but I think it changes optimizer's behavior, it is not the correct solution.

def patch_on_end_step(optimizer,group):
    group_index = optimizer.optimizer.param_groups.index(group)
    
    # my patch I think it's wrong,
    if group_index not in optimizer.optimizer.groups_to_process:
        return False
            
    # Decrement params processed so far.
    optimizer.optimizer.groups_to_process[group_index] -= 1
    ...

I'm not good at English, and the above translations are all done by machine translation. I hope I haven't offended anyone.

Exist-c avatar Jan 06 '25 19:01 Exist-c

Update: As of this commit for Prodigy+SF, all that should be needed in this pull request is to alter the assert; it will then be sufficient to set args.fused_backward_pass=True to activate FBP -- the optimiser will take care of the rest. Note that like Adafactor, Kohya only supports FBP for full finetuning (as far as I'm aware).

Previous comment follows.


Hello all, and thanks for your interest in the optimiser. I made a best-effort attempt to match how Kohya had implemented fused backward pass for Adafactor, in the hope it would be fairly straightforward to add support. Seems it's a bit more involved!

I've had a closer look at the SD3 branch and decided it would be easier perhaps to monkey patch the Adafactor patching method. This has been done in my most recent commit (https://github.com/LoganBooker/prodigy-plus-schedule-free/commit/93339d859eb7b1119a004edecf417f5318227af8). Note I haven't created a new package for this just yet, so you'll need to use the code/repo directly to get this change.

What this means is that for this pull request, all you should need to do is tweak the hard-coded assert in train_util.py to allow Prodigy+SF as well. That's it (apart from installing/importing/selecting the optimiser itself). https://github.com/kohya-ss/sd-scripts/blob/e89653975ddf429cdf0c0fd268da0a5a3e8dba1f/library/train_util.py#L4633-L4636

Once that's done, you should be able to use the fused backward pass by passing fused_backward_pass=True to the optimiser, and setting args.fused_backward_pass=True to Kohya. Alternatively, you could retain the change that appends it to the optimiser arguments.

LoganBooker avatar Jan 08 '25 05:01 LoganBooker

Hello all, and thanks for your interest in the optimiser. I made a best-effort attempt to match how Kohya had implemented fused backward pass for Adafactor, in the hope it would be fairly straightforward to add support. Seems it's a bit more involved!

I've had a closer look at the SD3 branch and decided it would be easier perhaps to monkey patch the Adafactor patching method. This has been done in my most recent commit (LoganBooker/prodigy-plus-schedule-free@93339d8). Note I haven't created a new package for this just yet, so you'll need to use the code/repo directly to get this change.

What this means is that for this pull request, all you should need to do is tweak the hard-coded assert in train_util.py to allow Prodigy+SF as well. That's it (apart from installing/importing/selecting the optimiser itself).

https://github.com/kohya-ss/sd-scripts/blob/e89653975ddf429cdf0c0fd268da0a5a3e8dba1f/library/train_util.py#L4633-L4636

Once that's done, you should be able to use the fused backward pass by passing fused_backward_pass=True to the optimiser, and setting args.fused_backward_pass=True to Kohya. Alternatively, you could retain the change that appends it to the optimiser arguments.

someone should take a look at this and the suggestion from Exist-c. I won't be able to update this PR for a while while I fix some PC troubles

michP247 avatar Jan 08 '25 20:01 michP247

i did trainings with prodigyplus.ProdigyPlusScheduleFree yesterday but it didnt learn anything

i must be missing optimizer arguments can you give me solid example? thank you

i trained up to 5600 steps and another model 2800 steps

FurkanGozukara avatar Feb 25 '25 09:02 FurkanGozukara