CLIP-fine-tune icon indicating copy to clipboard operation
CLIP-fine-tune copied to clipboard

Fine-tuning code for CLIP models

Changes 01/May/24:


Fine-tuning code for CLIP! 🤩

  • Originally made for Long-CLIP, see zer0int/Long-CLIP for fine-tuning Long-CLIP
  • This repo is for fine-tuning the original OpenAI/CLIP models!

Optimized for: ViT-L/14 (Text Encoder of SD / SDXL) + I have 1 NVIDIA GPU with 24 GB VRAM available... 😅 But you can train any OpenAI/CLIP model with this (just remember to tweak batch_size etc. for smaller model, if applicable!).

You won't win benchmarks with throwing small batch_sizes at a big model such as ViT-L/14; but using a finetune as the text encoder for e.g. Stable Diffusion SDXL, this CLIP will win some hearts! 💙🤖

  • Uses AMP (automatic mixed precision) + AdaBelief optimizer (optional: fall back to AdamW) + OneCycleLR scheduler with warmup
  • Gradually unfreeze CLIP (optional) or train whole model (default) + set Learning Rate for individual parameters (optional)
  • Debug print when exploding or vanishing gradients occur + Many fancy logs and plots with live training updates

How to use:

0. Install the dependencies from requirements-finetune.txt.

1. ft-A-clip-interrogator-csv-to-json-labels.py

  • Converts a "desc.csv" from CLIP Interrogator to dataset labels .json.
  • Example: ft-X-example-my-dataset-labels.json is the expected format for my fine-tuning script; if you have a different format - e.g. single text files next to images - explain that to GPT-4, Claude 3, or any other AI assistant + "and I need to convert them to be labels in a single .json file that should look like so:" copy-paste the content of ft-X-example-my-dataset-labels.json into prompt as a one-shot example
  • If you load your dataset: dataset1 = ImageTextDataset("path/to/image/folder", "path/to/my-text-labels.json", transform=preprocess), and inside the .json images are: "subpath/to/0001.jpg" -> then the script dataloader will look for the image in "path/to/image/folder/subpath/to/0001.jpg".

2. ft-A-augment-data-color-jitter.py

  • Data augmentation: If your dataset is ~1000 images, consider augmenting the images by flipping them horizontally etc.
  • The script example will create a copy of your images with color jitter, which prevents CLIP from overfitting on specific colors.
  • Use augmented images with .json labels and randomly select from multiple labels for a given image. See code in (3) for details.

3. ft-B-train-OpenAI-CLIP-ViT-L-14.py

  • Fine-tune CLIP. Insert dataset .json and path to images as per previous step. See code # comments for details.
  • 10,000 text-image pairs can archive good fine-tuning results within 1-2 hours (RTX 4090).

4. ft-C-convert-for-SDXL-comfyUI-OpenAI-CLIP.py

  • Convert the torch.save model .pt into a state_dict you can then just plug into SDXL as the text encoder.
  • Easy as Pi with ComfyUI, see comfyanonymous/ComfyUI for details!

instructions-comfyui

5. Example benefit of fine-tuning CLIP: Crazy "DeepDream of CLIP's own Neurons" dataset. Don't ask. ;-)

  • Same random seed etc., just swapping out the original CLIP text encoder for my fine-tune. CFG scale 14 = high CLIP influence / guidance.
  • Please note: The U-Net of SDXL was also trained on the same dataset, with a frozen CLIP (independent of CLIP).
  • For fine-tuning the SDXL U-Net Diffusion Model to complement CLIP, please refer to kohya-ss/sd-scripts

why-you-should