sd-scripts Support for multiple captions in one file

This is a simple little change which adds support for using multiple captions in a single caption file, one per line. During training, each time the caption is needed, a random line is sampled.

I'm using this with a line of WD14 tags, a LLaVa natural language caption, and a LLaVa natural language caption which was prompted by including the WD14 tags. This seems to have helped quite a bit with training robustness.

It would be cleaner to assign an array of captions to image_info.caption rather than splitting the lines on each pass, but I wasn't sure if there was a dependency elsewhere on caption being a string, so this is just my first little quick and dirty pass.

Mar 05 '24 04:03 cheald

this would be nice i can implement this feature to my batch captioners so people can have all different captioners being used

Mar 05 '24 12:03 FurkanGozukara

Thank you for this PR! I've implemented wildcard notation (like {aaa|bbb|ccc}) in dev branch. Is it fine to use it instead of this?

Mar 05 '24 13:03 kohya-ss

Thank you for this PR! I've implemented wildcard notation (like {aaa|bbb|ccc}) in dev branch. Is it fine to use it instead of this?

new line much easier because to turn caption into this notation requires a lot of work however appending new caption into a new line so easier

Mar 05 '24 13:03 FurkanGozukara

In the extension of the WebUI, using {A|B} is called Dynamic Prompts, while selecting randomly from different lines in a text file is called Wildcards. I think both can coexist.

Additionally, I believe that combining a wildcard with "an image of XXX" should help in directly leveraging trigger words for effects.

Mar 05 '24 14:03 gesen2egee

I agree with new line is much easier. However this can be replaced by wildcard, but wildcard cannot be replaced. So I'm not sure this is worth to implement...

In addition, this is a breaking change, so some option is needed to enable this feature.

Mar 05 '24 14:03 kohya-ss

yes those wild card or that notation gives more features but it is really hard to make assume that you have 1000 images you can run multiple different captioners and append new line

Mar 05 '24 14:03 FurkanGozukara

Thank you for this PR! I've implemented wildcard notation (like {aaa|bbb|ccc}) in dev branch. Is it fine to use it instead of this?

I think they should be able to coexist; multiple captions, one per line, with wildcards within a line should be compatible with each other and should both serve to improve training robustness.

In addition, this is a breaking change, so some option is needed to enable this feature.

I don't think it's actually a breaking change, because the current code only considers the first line of the caption already - subsequent lines are ignored. This essentially just allows for additional lines to be considered. Am I missing something there?

Mar 05 '24 21:03 cheald

If a user wishes to utilize multiple captions, derived from raw data, a tagger, or a Vision-Language Model (VLM), the script could handle this through an alternative format or file. This approach would address issues related to multiline captions.

Below are my suggestions and code modifications. I have altered the caption attribute in the ImageInfo class to a property. The ImageInfo class will now output the caption by managing a JSON file.

Change.

self.caption to self._caption
self.caption is a property
ImageInfo load JSON file which has same file name with image.
JSON contains "captions", "probabilities", "sampling method"
If script call ImageInfo.caption, ImageInfo will return sampled caption.

class ImageInfo:
    def __init__(self, image_key: str, num_repeats: int, caption: str, is_reg: bool, absolute_path: str) -> None:
        self.image_key: str = image_key
        self.num_repeats: int = num_repeats
        self._caption: str = caption
        self.is_reg: bool = is_reg
        self.absolute_path: str = absolute_path
        self.image_size: Tuple[int, int] = None
        self.resized_size: Tuple[int, int] = None
        self.bucket_reso: Tuple[int, int] = None
        self.latents: torch.Tensor = None
        self.latents_flipped: torch.Tensor = None
        self.latents_npz: str = None
        self.latents_original_size: Tuple[int, int] = None  # original image size, not latents size
        self.latents_crop_ltrb: Tuple[int, int] = None  # crop left top right bottom in original pixel size, not latents size
        self.cond_img_path: str = None
        self.image: Optional[Image.Image] = None  # optional, original PIL Image
        # SDXL, optional
        self.text_encoder_outputs_npz: Optional[str] = None
        self.text_encoder_outputs1: Optional[torch.Tensor] = None
        self.text_encoder_outputs2: Optional[torch.Tensor] = None
        self.text_encoder_pool2: Optional[torch.Tensor] = None

        # Sampling random caption
        self.json_caption = None
        """{
            "captions"  : List[str, ...],
            "p"         : List[float, ...],
            "sampling"  : choices=["weight", "softmax", "uniform"],
            ...
        }
        """
        self.setup_random_caption()

    @staticmethod
    def validate_json_caption_data(image_path: str, json_caption: dict) -> Tuple[List[str], np.ndarray]:
        captions: list = json_caption["captions"]
        p: list = json_caption["p"]
        assert all(isinstance(caption, str) for caption in captions), f"Type of caption must be string. Please check json file of image: {image_path}."
        assert isinstance(p, list), f"Type of probability must be list, but got {type(p)}"

        p: np.ndarray = np.asarray(p, dtype=np.float32)
        if len(p) != len(captions):
            print(f"Image: {image_path}. You should set same length for both probabilites and captions. Otherwise, it will use equal probability of each captions.")
            p = np.ones((len(captions)), dtype=np.float32) / float(len(captions))

        sampling: str = json_caption["sampling"]
        if sampling is None:
            return captions, p
        elif not isinstance(sampling, str):
            print(f"Image: {image_path}. sampling method is not a string. got type: {type(sampling)}, value: {sampling}")
            return captions, p

        if sampling.lower() == "weight":
            s_p = p.sum()
            if s_p != 1.:
                print(f"Image: {image_path}. Sum of probabilities is not equal to 1. It will automatically normalize probabilities.")
            p = p / s_p
        elif sampling.lower() == "softmax":
            def softmax(p: np.ndarray) -> np.ndarray:
                p = np.exp(p - p.max())
                p = p / p.sum()
                return p
            p = softmax(p)
        elif sampling.lower() == "uniform":
            p = np.ones((len(captions)), dtype=np.float32) / float(len(captions))
        else:
            assert False, f"Image: {image_path}. Not supported sampling method. got {sampling}, only support 'weight', 'softmax', 'uniform'"

        return captions, p

    @property
    def caption(self) -> str:
        if self.json_caption:
            _caption = np.random.choice(self._captions, p=self._captions_p)
            # print("[caption] load caption from json", _caption, "from ", self._captions, self._captions_p) # debug
            return _caption
        else:
            return self._caption

    def setup_random_caption(self) -> None:
        json_caption_path = os.path.splitext(self.absolute_path)[0] + ".json"
        if not os.path.exists(json_caption_path):
            self.json_caption = False
        else:
            with open(json_caption_path, "r", encoding="utf-8") as json_file:
                self.json_caption = json.load(json_file)

            captions, p = ImageInfo.validate_json_caption_data(self.absolute_path, self.json_caption)
            self._captions_p = p
            self._captions = captions

        assert self.json_caption is not None, f"Image: {self.image_key}. Loading json file failed."

Mar 08 '24 05:03 BootsofLagrangian

I don't think it's actually a breaking change, because the current code only considers the first line of the caption already - subsequent lines are ignored. This essentially just allows for additional lines to be considered. Am I missing something there?

I know I am somewhat paranoid, but if an existing caption file has multiple lines (including additional CRLFs), the lines will be unintentionally chosen at random.

The easiest and simplest implementation would be to convert the multiple lines to wildcard notation immediately after reading the file. Unfortunately, it cannot coexist with wildcard notation in a line, though.

BootsofLagrangian's implementation is excellent, but it has a large impact and takes a lot of work to test, so I prefer a simpler implementation.

Mar 08 '24 10:03 kohya-ss

i think you dont need to support all at once

for example for multi line caption people can provide a new parameter like

--multi_line_caption :)

Mar 08 '24 15:03 FurkanGozukara

Right now, if the caption file contains multiple lines, all but the first are ignored; unless this is documented somewhere, I'd suggest that it's probably surprising behavior, and a random selection is better than none at all. My implementation does ignore blank lines, so there would be no chance of a blank line being randomly sampled.

That said, it wouldn't be hard to add a switch along the lines of --multiple_captions to turn on the multi-caption behavior, so that existing behavior can be fully preserved. I like the idea of support for a more complex format, too, though for my purposes, just random sampling out of a multi-line file is sufficient :)

Mar 08 '24 19:03 cheald

I understand the value of multi-line captions :)

I would like to add a command line option for this, so I can implement this myself? I also would like to release this feature at the same time as wildcard, which is implemented in the dev branch, so it shouldn't take too long.

The implementation will be simple, it will choice a line randomly.

Mar 12 '24 10:03 kohya-ss

Absolutely, implement however you'd like -- I'd just be happy to have it in the tool!

Mar 16 '24 06:03 cheald

I have updated dev branch to support multiline captions. I hope you could test it😀

Mar 23 '24 09:03 kohya-ss

I have updated dev branch to support multiline captions. I hope you could test it😀

awesome

so we just add multiple lines and they work right?

Mar 25 '24 00:03 FurkanGozukara

so we just add multiple lines and they work right?

Please enable it with --enable_wildcard option or enable_wildcard = true in .toml.

https://github.com/kohya-ss/sd-scripts/blob/dev/docs/config_README-en.md#multi-line-captions

Mar 25 '24 01:03 kohya-ss

so we just add multiple lines and they work right?

Please enable it with --enable_wildcard option or enable_wildcard = true in .toml.

https://github.com/kohya-ss/sd-scripts/blob/dev/docs/config_README-en.md#multi-line-captions

--enable_wildcard got it i never used toml so far :D

Mar 25 '24 03:03 FurkanGozukara

I tried --enable_wildcard out for multiline captions. Seems to work when I tried adding a second caption to most of my face images.

And it works well for training on faces, as adding additional captions to each face pic makes it generalize the face better, making the learned face show up more often, and more accurately, when inferencing new images with captions other than the ones in the training set.

The use of multiple captions also seems to make the training last for longer before the high-contrast appearance of overtraining sets in, although I'm not certain of this yet.

One thing is that I can't really see that it's enabled, other than the call to --enable_wildcard didn't get rejected. Maybe some debug line output like 'additional captions detected' would be useful if it's worked? It's not too important for this though, cause I'm pretty sure it's working.

~~Edit: ...except just as soon as I posted that I realized that I'm still on main, not the branch. So it wasn't running at all, and the 'improvements' I saw were just confirmation bias. I'll give the actual branch that has this feature a try tomorrow.~~

Edit2: I see now that kohya ported this change independently to their dev branch, rather than using this pull request. I'm using the dev branch, so it was indeed working for me - and my comments about improvements were likely correct, rather than being confirmation bias. Perhaps some debug output to say that multiple captions are detected would be useful after all?

May 25 '24 23:05 araleza

hello again. what is the latest status of this? i want to today use this multi line feature. what do i need to do? write each line a different captions and it will randomly read? @kohya-ss @araleza @cheald

Jul 08 '24 21:07 FurkanGozukara

hello again. what is the latest status of this? i want to today use this multi line feature. what do i need to do? write each line a different captions and it will randomly read? @kohya-ss @araleza @cheald

It's available in the main branch. You just pass --enable_wildcard as a command line parameter.

Each line in your caption file then becomes its own caption, and captions will be chosen at random each time that image comes up in the training loop. Blank lines are ignored.

I got very good results from multi-line captions. Previously I thought my model had become overtrained/burned because I didn't have enough images, but adding extra captions has some of the same effect as adding entirely new images.

Jul 08 '24 21:07 araleza

@araleza ty so much time to test it

Jul 08 '24 22:07 FurkanGozukara

hello again. what is the latest status of this? i want to today use this multi line feature. what do i need to do? write each line a different captions and it will randomly read? @kohya-ss @araleza @cheald

It's available in the main branch. You just pass --enable_wildcard as a command line parameter.

Each line in your caption file then becomes its own caption, and captions will be chosen at random each time that image comes up in the training loop. Blank lines are ignored.

I got very good results from multi-line captions. Previously I thought my model had become overtrained/burned because I didn't have enough images, but adding extra captions has some of the same effect as adding entirely new images.

thanks @araleza ! This could be very useful. Is there an explanation of the enable_wildcard syntax somewhere? (or example) It looks like a very powerful tool to generate caption diversity and avoid overcooking the text encoder.

Aug 14 '24 13:08 ddpasa

thanks @araleza ! This could be very useful. Is there an explanation of the enable_wildcard syntax somewhere? (or example) It looks like a very powerful tool to generate caption diversity and avoid overcooking the text encoder.

Please see this file: https://github.com/kohya-ss/sd-scripts/blob/main/docs/config_README-en.md#multi-line-captions

Aug 14 '24 13:08 kohya-ss

sd-scripts sd-scripts copied to clipboard

Support for multiple captions in one file

sd-scripts
sd-scripts copied to clipboard