stable-diffusion-webui fix: model caching now requires deepcopy

Not sure why this change is required.

Nov 04 '22 15:11 digburn

Fixes https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4305

Nov 05 '22 04:11 aliencaocao

i made a pull request #4514 to fix the checkpoint caching which will conflict with this one. I think the caching is done wrong, as it adds the weight to the cache, before it is loaded from file.

i think one problem was currently it is using model.sd_checkpoint_info for the key, but this might be the old info, the new is passed into the method as parameter. and also the model.state_dict() is not the one we want to cache, since it needs to be loaded first. checkpoints_loaded[model.sd_checkpoint_info] = model.state_dict().copy()

Edit: as commented on my PR, both versions seems to work fine, i am also absolutely fine with merging this PR, and i will close mine.

Nov 09 '22 06:11 cluder

It's true. Model caching now requires deepcopy somehow. Tests I did in #4214 passed just fine while it still used dict.copy(). The same test now didn't pass in #4666, until I changed it to use deepcopy. I did tests for #4214 13 days ago, Nov 4... that's the same day as this PR.

According to revision history, I revised #4214 to show the test I did at Nov 4 12:21 PM. My last merge from upstream prior to that was Nov 4 11:04 AM: 95457b35dd03d67eada082ddf65ce4a6b87f29df The most recent commit from upstream in that merge is this commit: 59a21a67d220d6943be2fc3c5632c02c3ffc99d1

Your PR was on Nov 4 10:55 PM. This is the oldest one I know about this issue. The latest upstream commit in this PR, prior to your fix: d61f0ded24b2b0c69b1b85293d6f71e4de0d1c63

So whatever happened should be between 59a21a67d220d6943be2fc3c5632c02c3ffc99d1 and d61f0ded24b2b0c69b1b85293d6f71e4de0d1c63

Nov 17 '22 11:11 R-N

@R-N When you made this PR https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/17f9e55667ba4144232543bf12b94e75ebd4bd6f did you check if it was working?

Nov 17 '22 11:11 aliencaocao

Did the alternative PR for this issue not fix this? (https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4514)

I'm pretty sure model caching is working now but I can check when I get a chance

Nov 17 '22 11:11 digburn

@R-N When you made this PR 17f9e55 did you check if it was working?

I did. I didn't document it though, so I wasn't sure if I did it well.

Nov 17 '22 11:11 R-N

Did the alternative PR for this issue not fix this? (#4514)

Yes, it solved the checkpoint loading. But I just found out that the same thing happened to VAE, while it didn't happen before. So model caching used to not require deepcopy at all, even when it's done right before loading new weights.

Nov 17 '22 11:11 R-N

@R-N When you made this PR 17f9e55 did you check if it was working?

I did. I didn't document it though, so I wasn't sure if I did it well.

So that means we can further zoom in to the faulty PR? Since your this PR is in the middle of the range you stated previously.

Nov 17 '22 12:11 aliencaocao

So that means we can further zoom in to the faulty PR? Since your this PR is in the middle of the range you stated previously.

Well, yes, we can narrow it down, but

I'm not sure if it's "faulty". The change might be intentional, for something better.
Since this has workaround already, it might not be worth it to spend time to track it down. There are still a lot of commits between the range I mentioned (66 commits according to git rev-list --count).

I'm just mentioning this here to confirm this change, that it was possible to just state_dict().copy() to cache, but now deepcopy is needed, or whatever is done between checkpoint loads. And the commit range, well, just additional info in case of future issues.

Nov 17 '22 13:11 R-N

When you made this PR 17f9e55 did you check if it was working?

I just realized that I can just make a branch on my commit (bf7a699845675eefdabb9cfa40c55398976274ae) and test it. Yes it was broken from the start. Guess I only tested whether it loaded the correct hash from cache, and didn't test image generation comparison.

xy_grid-0001-1-ganyu (genshin impact), (1)

So state dict copying has always required deepcopy. Pytorch tutorial on saving and loading models says:

You must serialize best_model_state or use best_model_state = deepcopy(model.state_dict()) otherwise your best best_model_state will keep getting updated by the subsequent training iterations.

And I suppose checkpoint weight loading work the same way as weight updates by training.

Testing #4214 was fine because #4036 hadn't been merged yet. It was merged on Nov 4 at 2:54 PM. I tested #4214 before 12:21 PM, same day. Also there was practically no caching when I tested for #4214 because I used caching=1 and it was bugged.

Sorry about all this.

Nov 21 '22 12:11 R-N

@R-N I can just implement this on where the code is now

Nov 21 '22 17:11 digburn

@R-N new PR here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4991

Nov 23 '22 17:11 digburn

@R-N new PR here: #4991

Wait, what. I was just apologizing about it...

Anyway, if you do deepcopy with the current caching flow, you'll end up with a deepcopy of the current model too. It should take up more memory. The current implementation works fine without deepcopy. It copies the current model but it's a shallow copy so it should be fine... I think? I don't know where the magic happens that makes it work without deepcopy.

To sum it up, I think deepcopy isn't necessary for the current implementation and will instead unnecessarily increase memory usage. This is just my opinion though so you might want to test the memory usage.

Nov 24 '22 05:11 R-N

Fair enough, sorry I was confused and thought it wasn't working.

Nov 24 '22 12:11 digburn