stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Feature Request]: add blip2 model to "Preprocess images".
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What would your feature do ?
It will use blip2 models for text desc of images
Proposed workflow
- Go to Train
- Press Preprocess images
- Chose Use BLIP for caption.
- Add here Use BLIP2 for caption.
Additional information
No response
Great idea !!!
Use one of these extensions. If you need to to unload the previous model, you can use supermerger's unload model button. https://github.com/p1atdev/stable-diffusion-webui-blip2-captioner https://github.com/Tps-F/sd-webui-blip2
I was working on this, but I am unable to understand some of the code structure. But I will figure it out, and try to add this feature. But if it is possible, can you give me a brief overview of the overall structure.
@ArjunDevSingla I hope this helps 👇
Here's a brief summary of @p1atdev stable-diffusion-webui-blip2-captioner/blip2.py:
This Python code defines a class BLIP2
which is used to generate captions for images using a pre-trained model. The code uses the PyTorch library and relies on a separate module called lavis.models
.
- Import the required libraries -
torch
,typing
,PIL.Image
, andlavis.models
. - Define the
BLIP2
class with an__init__
method that takes amodel_type
argument: a. Determine the device (GPU or CPU) for running the model based on the availability of CUDA. b. Load the pre-trained model and preprocessors using theload_model_and_preprocess
function fromlavis.models
. - Define a
generate_caption
method for theBLIP2
class with several parameters, including the input image and options for controlling the caption generation process: a. Preprocess the input image using the visual preprocessor and move it to the appropriate device (GPU or CPU). b. Generate captions using the pre-trained model and the given parameters for beam search, nucleus sampling, maximum and minimum caption length, and repetition penalty. c. Return the generated captions. - Define an
unload
method for theBLIP2
class to free up memory by deleting the model, preprocessors, and clearing the GPU cache.
The code provides an interface for loading a pre-trained model, generating captions for images, and then unloading the model to free up resources.
Here is a brief summary of @p1atdev stable-diffusion-webui-blip2-captioner/scripts/main.py
This script is a Python program for generating captions for images using BLIP2
. It provides both single-image captioning and batch-image captioning functionalities. The program uses the Gradio library to create a user interface for easy interaction.
- Import necessary libraries and modules, such as
os
,pathlib
,torch
,gradio
, andPIL
. - Set
ImageFile.LOAD_TRUNCATED_IMAGES
toTrue
to allow loading of truncated images. - Import
script_callbacks
from themodules
package. - Import the
BLIP2
class from theblip2
module. - Create an empty dictionary called
captioners
to store loaded models. - Define a list called
model_list
containing the names of available models ("coco" and "pretrain"). - Define a list called
sampling_methods
containing the names of available sampling methods ("Nucleus" and "Top-K"). - Define a function
model_check
that checks if a model is already loaded or not, and loads the model if it's not in thecaptioners
dictionary. - Define a function
unload_models
that unloads all the models in thecaptioners
dictionary and clears GPU cache. - Define a function
generate_caption
that takes an image and various caption generation parameters, and returns a generated caption for the image. - Define a function
generate_caption_for_single_image
that takes an image and caption generation parameters, and returns a caption for the image. - Define a function
create_caption_file
that takes a caption and an output file path, and writes the caption to a file at the specified path. - Define a function
batch_captioning
that takes input and output directories, caption file extension, and caption generation parameters, and generates captions for all the images in the input directory, saving them to the output directory. - Define a function
on_ui_tabs
that creates the Gradio user interface with two tabs: "Single" for single image captioning and "Batch" for batch image captioning. The interface includes various input elements, such as image upload, text boxes, dropdowns, sliders, and buttons. - Register the
on_ui_tabs
function with thescript_callbacks
module using theon_ui_tabs
method.