stable-diffusion-webui CLIP crop (BLIP Crop, YOLO Crop?)

CLIP crop (BLIP Crop, YOLO Crop?)

Open d8ahazard opened this issue 1 year ago • 13 comments

So, this started out as a mishmash of another library called CLIPCrop, and then quickly devolved into my own hellapalooza of borrowed code.

Basically, what it does, is first interrogates the model using BLIP or whatever we have set up, but sort of dumbed down for a basic description of the image.

Then, it passes it through the YOLOV5 object detction model, which breaks the image up into various detected components.

Finally, we feed it the prompt from CLIP, determine the most likely "main" subject of the photo, get a bounding box for that subject, and then expand it to match the dimensions we're scaling to. We then crop the image to those dimensions, and then downscale as necessary to reach the final result.

Compared to "no crop influence", "auto focal point crop", and my previous attempt at a method, I think the results speak for themselves. Will this work perfectly for every use-case? No. Is it a lot smarter than before? I think so.

OH, also, you don't have to interrogate the image with CLIP. I would have to add it to the UI, but you can also pass a specific prompt to the pre-processor to target a specific image element.

This is a replacement for #3670

Oct 27 '22 02:10 d8ahazard

You're right, insane.

I'm surprised you're not exposing the prompt parameter to allow textually guided clipping.

Oct 27 '22 12:10 dfaker

You're right, insane.

I'm surprised you're not exposing the prompt parameter to allow textually guided clipping.

All in good time. Fixed some issues where it couldn't detect subjects and target sizes were less than ideal.

Need to copy how the "focal crop" method hides/shows elements in the UI, then I can add the prompt as an option.

Oct 27 '22 19:10 d8ahazard

@AUTOMATIC1111 - I think this one should be ready to go if you wanna give it a once-over.

Oct 27 '22 19:10 d8ahazard

This is great

Oct 27 '22 23:10 captin411

With selecting smart crop and captioning

Traceback (most recent call last):
  File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
    modules.textual_inversion.preprocess.preprocess(*args)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
    preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 49, in preprocess_work
    clipseg = CropClip()
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 64, in __init__
    self.model = torch.hub.load('ultralytics/yolov5', 'custom', model_path[0])
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\hub.py", line 540, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\hub.py", line 569, in _load_local
    model = entry(*args, **kwargs)
  File "C:\Users\esfle/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 83, in custom
    return _create(path, autoshape=autoshape, verbose=_verbose, device=device)
  File "C:\Users\esfle/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 33, in _create
    from models.common import AutoShape, DetectMultiBackend
ModuleNotFoundError: No module named 'models.common'

after checking out again it now works but if launched with captioning gives this error:


Fusing layers...
YOLOv5m6 summary: 378 layers, 35704908 parameters, 0 gradients
Adding AutoShape...
  0%|                                                                                           | 0/24 [00:05<?, ?it/s]
Error completing request
Arguments: ('C:\\Users\\esfle\\OneDrive\\תמונות\\testShowcase', 'C:\\Users\\esfle\\OneDrive\\תמונות\\testShowcase\\outputtest', 512, 512, 'ignore', False, False, True, False, 0.5, 0.2, False, 0.9, 0.15, 0.5, False, True) {}
Traceback (most recent call last):
  File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
    modules.textual_inversion.preprocess.preprocess(*args)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
    preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 218, in preprocess_work
    save_pic(img, index, existing_caption=existing_caption)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 100, in save_pic
    save_pic_with_caption(image, index, existing_caption=existing_caption)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption
    caption += shared.interrogator.generate_caption(image)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\interrogate.py", line 128, in generate_caption
    caption = self.blip_model.generate(gpu_image, sample=False, num_beams=shared.opts.interrogate_clip_num_beams, min_length=shared.opts.interrogate_clip_min_length, max_length=shared.opts.interrogate_clip_max_length)
  File "F:\GitHub\stable-diffusion-webui-auto\repositories\BLIP\models\blip.py", line 129, in generate
    image_embeds = self.visual_encoder(image)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\repositories\BLIP\models\vit.py", line 182, in forward
    x = self.patch_embed(x)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\timm\models\layers\patch_embed.py", line 35, in forward
    x = self.proj(x)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "F:\GitHub\stable-diffusion-webui-auto\venv\lib\site-packages\torch\nn\modules\conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.HalfTensor) should be the same

Oct 29 '22 11:10 devilismyfriend

File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption caption += shared.interrogator.generate_caption(image)

So, I don't think this one is technically my fault.

If I look at the code for the two different methods in interrogate.py, I can see that "interrogate" does some shuffling of things in VRAM first, while "generate_clip" (the failing call) does not. I suspect something needs to be fixed there so that the input and weight are the same value type.

Oct 29 '22 14:10 d8ahazard

File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption caption += shared.interrogator.generate_caption(image)

So, I don't think this one is technically my fault.

If I look at the code for the two different methods in interrogate.py, I can see that "interrogate" does some shuffling of things in VRAM first, while "generate_clip" (the failing call) does not. I suspect something needs to be fixed there so that the input and weight are the same value type.

The better question would be - does this fail when NOT using the ClipCrop branch?

Oct 29 '22 14:10 d8ahazard

File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 70, in save_pic_with_caption caption += shared.interrogator.generate_caption(image)

So, I don't think this one is technically my fault. If I look at the code for the two different methods in interrogate.py, I can see that "interrogate" does some shuffling of things in VRAM first, while "generate_clip" (the failing call) does not. I suspect something needs to be fixed there so that the input and weight are the same value type.

The better question would be - does this fail when NOT using the ClipCrop branch?

Sorry, I mean it works without the ClipCrop branch

Oct 29 '22 14:10 devilismyfriend

encountered another error while processing a folder of images with clip crop only


Traceback (most recent call last):
  File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
    modules.textual_inversion.preprocess.preprocess(*args)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
    preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work
    im_data = clipseg.get_center(img)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center
    res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF)
cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'

Oct 29 '22 14:10 devilismyfriend

encountered another error while processing a folder of images with clip crop only


Traceback (most recent call last):
  File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
    modules.textual_inversion.preprocess.preprocess(*args)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
    preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work
    im_data = clipseg.get_center(img)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center
    res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF)
cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'

Now that's a good one. So, cv2.matchTemplate is the function I use to take the "isolated subject" image and find it's center within the "main" image. This is, for some reason, saying that the isolated image is larger than the source image, which shouldn't actually be possible.

Are you able to find the offending image and possibly provide it so I can test with it? I think I just need to perform the above check before running matchTemplate or putting it in a try/catch, but I'm still curious as to why it's happening.

Oct 29 '22 14:10 d8ahazard

encountered another error while processing a folder of images with clip crop only
Traceback (most recent call last):
  File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
    modules.textual_inversion.preprocess.preprocess(*args)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
    preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work
    im_data = clipseg.get_center(img)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center
    res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF)
cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'
Now that's a good one. So, cv2.matchTemplate is the function I use to take the "isolated subject" image and find it's center within the "main" image. This is, for some reason, saying that the isolated image is larger than the source image, which shouldn't actually be possible.

Are you able to find the offending image and possibly provide it so I can test with it? I think I just need to perform the above check before running matchTemplate or putting it in a try/catch, but I'm still curious as to why it's happening.

I'll see if I can censor the person in it as it's a photo of a family member.

Oct 29 '22 15:10 devilismyfriend

encountered another error while processing a folder of images with clip crop only
Traceback (most recent call last):
  File "F:\GitHub\stable-diffusion-webui-auto\modules\ui.py", line 221, in f
    res = list(func(*args, **kwargs))
  File "F:\GitHub\stable-diffusion-webui-auto\webui.py", line 63, in f
    res = func(*args, **kwargs)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\ui.py", line 19, in preprocess
    modules.textual_inversion.preprocess.preprocess(*args)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 30, in preprocess
    preprocess_work(process_src, process_dst, process_width, process_height, preprocess_txt_action, process_flip,
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\preprocess.py", line 159, in preprocess_work
    im_data = clipseg.get_center(img)
  File "F:\GitHub\stable-diffusion-webui-auto\modules\textual_inversion\clipcrop.py", line 122, in get_center
    res = cv2.matchTemplate(numpy.array(image), numpy.array(out), cv2.TM_SQDIFF)
cv2.error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1175: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'
Now that's a good one. So, cv2.matchTemplate is the function I use to take the "isolated subject" image and find it's center within the "main" image. This is, for some reason, saying that the isolated image is larger than the source image, which shouldn't actually be possible. Are you able to find the offending image and possibly provide it so I can test with it? I think I just need to perform the above check before running matchTemplate or putting it in a try/catch, but I'm still curious as to why it's happening.
I'll see if I can censor the person in it as it's a photo of a family member.

Just take Nic Cage's face and paste it over the top. :D

Oct 29 '22 15:10 d8ahazard

The training stops incorrectly..... I set the training max steps to 5000 and save ckpt every 500 steps. Strangely, the training stops at step 500 and break. It prints: `Saving checkpoint at step 500. Successfully trained model for a total of 500 steps, converting to ckpt. 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [00:06<00:00, 9.35it/s] Caught exception. ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [00:06<00:00, 9.60it/s] Allocated: 15.4GB Reserved: 19.5GB

Exception training db: [Errno 22] Invalid argument: 'D:\stable-diffusion\models\dreambooth\xxx\logging\_500.png' Traceback (most recent call last): File "D:\stable-diffusion\modules\dreambooth\dreambooth.py", line 557, in train image.save(last_saved_image) File "D:\stable-diffusion\python\lib\site-packages\PIL\Image.py", line 2317, in save fp = builtins.open(filename, "w+b") OSError: [Errno 22] Invalid argument: 'D:\stable-diffusion\models\dreambooth\xxx\logging\_500.png'

CLEANUP: Allocated: 15.4GB Reserved: 19.5GB

Cleanup Complete. Allocated: 15.0GB Reserved: 15.7GB

Steps: 10%|█████████████ | 500/5000 [05:51<52:44, 1.42it/s] Training completed, reloading SD Model. Allocated: 0.0GB Reserved: 0.0GB`

Nov 01 '22 06:11 4thfever

stable-diffusion-webui stable-diffusion-webui copied to clipboard

CLIP crop (BLIP Crop, YOLO Crop?)

stable-diffusion-webui
stable-diffusion-webui copied to clipboard