stable-diffusion-webui
stable-diffusion-webui copied to clipboard
CLIP crop (BLIP Crop, YOLO Crop?)
So, this started out as a mishmash of another library called CLIPCrop, and then quickly devolved into my own hellapalooza of borrowed code.
Basically, what it does, is first interrogates the model using BLIP or whatever we have set up, but sort of dumbed down for a basic description of the image.
Then, it passes it through the YOLOV5 object detction model, which breaks the image up into various detected components.
Finally, we feed it the prompt from CLIP, determine the most likely "main" subject of the photo, get a bounding box for that subject, and then expand it to match the dimensions we're scaling to. We then crop the image to those dimensions, and then downscale as necessary to reach the final result.
Compared to "no crop influence", "auto focal point crop", and my previous attempt at a method, I think the results speak for themselves. Will this work perfectly for every use-case? No. Is it a lot smarter than before? I think so.
OH, also, you don't have to interrogate the image with CLIP. I would have to add it to the UI, but you can also pass a specific prompt to the pre-processor to target a specific image element.
This is a replacement for #3670
You're right, insane.
I'm surprised you're not exposing the prompt
parameter to allow textually guided clipping.
You're right, insane.
I'm surprised you're not exposing the
prompt
parameter to allow textually guided clipping.
All in good time. Fixed some issues where it couldn't detect subjects and target sizes were less than ideal.
Need to copy how the "focal crop" method hides/shows elements in the UI, then I can add the prompt as an option.
@AUTOMATIC1111 - I think this one should be ready to go if you wanna give it a once-over.
This is great
With selecting smart crop and captioning
Traceback (most recent call last):
File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
res = list(func(*args, **kwargs))
File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
res = func(*args, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
modules.textual_inversion.preprocess.preprocess(*args)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 49, in preprocess_work
clipseg = CropClip()
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 64, in __init__
self.model = torch.hub.load('ultralytics/yolov5', 'custom', model_path[0])
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\hub.py", line 540, in load
model = _load_local(repo_or_dir, model, *args, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\hub.py", line 569, in _load_local
model = entry(*args, **kwargs)
File "C:\Users\esfle/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 83, in custom
return _create(path, autoshape=autoshape, verbose=_verbose, device=device)
File "C:\Users\esfle/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 33, in _create
from models.common import AutoShape, DetectMultiBackend
ModuleNotFoundError: No module named 'models.common'
after checking out again it now works but if launched with captioning gives this error:
Fusing layers...
YOLOv5m6 summary: 378 layers, 35704908 parameters, 0 gradients
Adding AutoShape...
0%| | 0/24 [00:05<?, ?it/s]
Error completing request
Arguments: ('C:\\Users\\esfle\\OneDrive\\תמונות\\testShowcase', 'C:\\Users\\esfle\\OneDrive\\תמונות\\testShowcase\\outputtest', 512, 512, 'ignore', False, False, True, False, 0.5, 0.2, False, 0.9, 0.15, 0.5, False, True) {}
Traceback (most recent call last):
File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
res = list(func(*args, **kwargs))
File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
res = func(*args, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
modules.textual_inversion.preprocess.preprocess(*args)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 218, in preprocess_work
save_pic(img, index, existing_caption=existing_caption)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 100, in save_pic
save_pic_with_caption(image, index, existing_caption=existing_caption)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption
caption += shared.interrogator.generate_caption(image)
File "F:\GitHub\stable-diffusion-webui-auto\modules\interrogate.py", line 128, in generate_caption
caption = self.blip_model.generate(gpu_image, sample=False, num_beams=shared.opts.interrogate_clip_num_beams, min_length=shared.opts.interrogate_clip_min_length, max_length=shared.opts.interrogate_clip_max_length)
File "F:\GitHub\stable-diffusion-webui-auto\repositories\BLIP\models\blip.py", line 129, in generate
image_embeds = self.visual_encoder(image)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\repositories\BLIP\models\vit.py", line 182, in forward
x = self.patch_embed(x)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\timm\models\layers\patch_embed.py", line 35, in forward
x = self.proj(x)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.HalfTensor) should be the same
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption caption += shared.interrogator.generate_caption(image)
So, I don't think this one is technically my fault.
If I look at the code for the two different methods in interrogate.py, I can see that "interrogate" does some shuffling of things in VRAM first, while "generate_clip" (the failing call) does not. I suspect something needs to be fixed there so that the input and weight are the same value type.
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption caption += shared.interrogator.generate_caption(image)
So, I don't think this one is technically my fault.
If I look at the code for the two different methods in interrogate.py, I can see that "interrogate" does some shuffling of things in VRAM first, while "generate_clip" (the failing call) does not. I suspect something needs to be fixed there so that the input and weight are the same value type.
The better question would be - does this fail when NOT using the ClipCrop branch?
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption caption += shared.interrogator.generate_caption(image)
So, I don't think this one is technically my fault. If I look at the code for the two different methods in interrogate.py, I can see that "interrogate" does some shuffling of things in VRAM first, while "generate_clip" (the failing call) does not. I suspect something needs to be fixed there so that the input and weight are the same value type.
The better question would be - does this fail when NOT using the ClipCrop branch?
Sorry, I mean it works without the ClipCrop branch
encountered another error while processing a folder of images with clip crop only
Traceback (most recent call last):
File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
res = list(func(*args, **kwargs))
File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
res = func(*args, **kwargs)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
modules.textual_inversion.preprocess.preprocess(*args)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work
im_data = clipseg.get_center(img)
File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center
res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF)
cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'
encountered another error while processing a folder of images with clip crop only
Traceback (most recent call last): File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f res = list(func(*args, **kwargs)) File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f res = func(*args, **kwargs) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess modules.textual_inversion.preprocess.preprocess(*args) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip, File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work im_data = clipseg.get_center(img) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF) cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'
Now that's a good one. So, cv2.matchTemplate is the function I use to take the "isolated subject" image and find it's center within the "main" image. This is, for some reason, saying that the isolated image is larger than the source image, which shouldn't actually be possible.
Are you able to find the offending image and possibly provide it so I can test with it? I think I just need to perform the above check before running matchTemplate or putting it in a try/catch, but I'm still curious as to why it's happening.
encountered another error while processing a folder of images with clip crop only
Traceback (most recent call last): File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f res = list(func(*args, **kwargs)) File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f res = func(*args, **kwargs) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess modules.textual_inversion.preprocess.preprocess(*args) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip, File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work im_data = clipseg.get_center(img) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF) cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'
Now that's a good one. So, cv2.matchTemplate is the function I use to take the "isolated subject" image and find it's center within the "main" image. This is, for some reason, saying that the isolated image is larger than the source image, which shouldn't actually be possible.
Are you able to find the offending image and possibly provide it so I can test with it? I think I just need to perform the above check before running matchTemplate or putting it in a try/catch, but I'm still curious as to why it's happening.
I'll see if I can censor the person in it as it's a photo of a family member.
encountered another error while processing a folder of images with clip crop only
Traceback (most recent call last): File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f res = list(func(*args, **kwargs)) File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f res = func(*args, **kwargs) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess modules.textual_inversion.preprocess.preprocess(*args) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip, File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work im_data = clipseg.get_center(img) File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF) cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'
Now that's a good one. So, cv2.matchTemplate is the function I use to take the "isolated subject" image and find it's center within the "main" image. This is, for some reason, saying that the isolated image is larger than the source image, which shouldn't actually be possible. Are you able to find the offending image and possibly provide it so I can test with it? I think I just need to perform the above check before running matchTemplate or putting it in a try/catch, but I'm still curious as to why it's happening.
I'll see if I can censor the person in it as it's a photo of a family member.
Just take Nic Cage's face and paste it over the top. :D
The training stops incorrectly..... I set the training max steps to 5000 and save ckpt every 500 steps. Strangely, the training stops at step 500 and break. It prints: `Saving checkpoint at step 500. Successfully trained model for a total of 500 steps, converting to ckpt. 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [00:06<00:00, 9.35it/s] Caught exception. ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [00:06<00:00, 9.60it/s] Allocated: 15.4GB Reserved: 19.5GB
Exception training db: [Errno 22] Invalid argument: 'D:\stable-diffusion\models\dreambooth\xxx\logging\_500.png' Traceback (most recent call last): File "D:\stable-diffusion\modules\dreambooth\dreambooth.py", line 557, in train image.save(last_saved_image) File "D:\stable-diffusion\python\lib\site-packages\PIL\Image.py", line 2317, in save fp = builtins.open(filename, "w+b") OSError: [Errno 22] Invalid argument: 'D:\stable-diffusion\models\dreambooth\xxx\logging\_500.png'
CLEANUP: Allocated: 15.4GB Reserved: 19.5GB
Cleanup Complete. Allocated: 15.0GB Reserved: 15.7GB
Steps: 10%|█████████████ | 500/5000 [05:51<52:44, 1.42it/s] Training completed, reloading SD Model. Allocated: 0.0GB Reserved: 0.0GB`