sd-scripts
sd-scripts copied to clipboard
Support for multiple captions in one file
This is a simple little change which adds support for using multiple captions in a single caption file, one per line. During training, each time the caption is needed, a random line is sampled.
I'm using this with a line of WD14 tags, a LLaVa natural language caption, and a LLaVa natural language caption which was prompted by including the WD14 tags. This seems to have helped quite a bit with training robustness.
It would be cleaner to assign an array of captions to image_info.caption rather than splitting the lines on each pass, but I wasn't sure if there was a dependency elsewhere on caption being a string, so this is just my first little quick and dirty pass.
this would be nice i can implement this feature to my batch captioners so people can have all different captioners being used
Thank you for this PR! I've implemented wildcard notation (like {aaa|bbb|ccc}) in dev branch. Is it fine to use it instead of this?
Thank you for this PR! I've implemented wildcard notation (like
{aaa|bbb|ccc}) in dev branch. Is it fine to use it instead of this?
new line much easier because to turn caption into this notation requires a lot of work however appending new caption into a new line so easier
In the extension of the WebUI, using {A|B} is called Dynamic Prompts, while selecting randomly from different lines in a text file is called Wildcards. I think both can coexist.
Additionally, I believe that combining a wildcard with "an image of XXX" should help in directly leveraging trigger words for effects.
I agree with new line is much easier. However this can be replaced by wildcard, but wildcard cannot be replaced. So I'm not sure this is worth to implement...
In addition, this is a breaking change, so some option is needed to enable this feature.
yes those wild card or that notation gives more features but it is really hard to make assume that you have 1000 images you can run multiple different captioners and append new line
Thank you for this PR! I've implemented wildcard notation (like
{aaa|bbb|ccc}) in dev branch. Is it fine to use it instead of this?
I think they should be able to coexist; multiple captions, one per line, with wildcards within a line should be compatible with each other and should both serve to improve training robustness.
In addition, this is a breaking change, so some option is needed to enable this feature.
I don't think it's actually a breaking change, because the current code only considers the first line of the caption already - subsequent lines are ignored. This essentially just allows for additional lines to be considered. Am I missing something there?
If a user wishes to utilize multiple captions, derived from raw data, a tagger, or a Vision-Language Model (VLM), the script could handle this through an alternative format or file. This approach would address issues related to multiline captions.
Below are my suggestions and code modifications. I have altered the caption attribute in the ImageInfo class to a property. The ImageInfo class will now output the caption by managing a JSON file.
Change.
- self.caption to self._caption
- self.caption is a property
- ImageInfo load JSON file which has same file name with image.
- JSON contains "captions", "probabilities", "sampling method"
- If script call ImageInfo.caption, ImageInfo will return sampled caption.
class ImageInfo:
def __init__(self, image_key: str, num_repeats: int, caption: str, is_reg: bool, absolute_path: str) -> None:
self.image_key: str = image_key
self.num_repeats: int = num_repeats
self._caption: str = caption
self.is_reg: bool = is_reg
self.absolute_path: str = absolute_path
self.image_size: Tuple[int, int] = None
self.resized_size: Tuple[int, int] = None
self.bucket_reso: Tuple[int, int] = None
self.latents: torch.Tensor = None
self.latents_flipped: torch.Tensor = None
self.latents_npz: str = None
self.latents_original_size: Tuple[int, int] = None # original image size, not latents size
self.latents_crop_ltrb: Tuple[int, int] = None # crop left top right bottom in original pixel size, not latents size
self.cond_img_path: str = None
self.image: Optional[Image.Image] = None # optional, original PIL Image
# SDXL, optional
self.text_encoder_outputs_npz: Optional[str] = None
self.text_encoder_outputs1: Optional[torch.Tensor] = None
self.text_encoder_outputs2: Optional[torch.Tensor] = None
self.text_encoder_pool2: Optional[torch.Tensor] = None
# Sampling random caption
self.json_caption = None
"""{
"captions" : List[str, ...],
"p" : List[float, ...],
"sampling" : choices=["weight", "softmax", "uniform"],
...
}
"""
self.setup_random_caption()
@staticmethod
def validate_json_caption_data(image_path: str, json_caption: dict) -> Tuple[List[str], np.ndarray]:
captions: list = json_caption["captions"]
p: list = json_caption["p"]
assert all(isinstance(caption, str) for caption in captions), f"Type of caption must be string. Please check json file of image: {image_path}."
assert isinstance(p, list), f"Type of probability must be list, but got {type(p)}"
p: np.ndarray = np.asarray(p, dtype=np.float32)
if len(p) != len(captions):
print(f"Image: {image_path}. You should set same length for both probabilites and captions. Otherwise, it will use equal probability of each captions.")
p = np.ones((len(captions)), dtype=np.float32) / float(len(captions))
sampling: str = json_caption["sampling"]
if sampling is None:
return captions, p
elif not isinstance(sampling, str):
print(f"Image: {image_path}. sampling method is not a string. got type: {type(sampling)}, value: {sampling}")
return captions, p
if sampling.lower() == "weight":
s_p = p.sum()
if s_p != 1.:
print(f"Image: {image_path}. Sum of probabilities is not equal to 1. It will automatically normalize probabilities.")
p = p / s_p
elif sampling.lower() == "softmax":
def softmax(p: np.ndarray) -> np.ndarray:
p = np.exp(p - p.max())
p = p / p.sum()
return p
p = softmax(p)
elif sampling.lower() == "uniform":
p = np.ones((len(captions)), dtype=np.float32) / float(len(captions))
else:
assert False, f"Image: {image_path}. Not supported sampling method. got {sampling}, only support 'weight', 'softmax', 'uniform'"
return captions, p
@property
def caption(self) -> str:
if self.json_caption:
_caption = np.random.choice(self._captions, p=self._captions_p)
# print("[caption] load caption from json", _caption, "from ", self._captions, self._captions_p) # debug
return _caption
else:
return self._caption
def setup_random_caption(self) -> None:
json_caption_path = os.path.splitext(self.absolute_path)[0] + ".json"
if not os.path.exists(json_caption_path):
self.json_caption = False
else:
with open(json_caption_path, "r", encoding="utf-8") as json_file:
self.json_caption = json.load(json_file)
captions, p = ImageInfo.validate_json_caption_data(self.absolute_path, self.json_caption)
self._captions_p = p
self._captions = captions
assert self.json_caption is not None, f"Image: {self.image_key}. Loading json file failed."
I don't think it's actually a breaking change, because the current code only considers the first line of the caption already - subsequent lines are ignored. This essentially just allows for additional lines to be considered. Am I missing something there?
I know I am somewhat paranoid, but if an existing caption file has multiple lines (including additional CRLFs), the lines will be unintentionally chosen at random.
The easiest and simplest implementation would be to convert the multiple lines to wildcard notation immediately after reading the file. Unfortunately, it cannot coexist with wildcard notation in a line, though.
BootsofLagrangian's implementation is excellent, but it has a large impact and takes a lot of work to test, so I prefer a simpler implementation.
i think you dont need to support all at once
for example for multi line caption people can provide a new parameter like
--multi_line_caption :)
Right now, if the caption file contains multiple lines, all but the first are ignored; unless this is documented somewhere, I'd suggest that it's probably surprising behavior, and a random selection is better than none at all. My implementation does ignore blank lines, so there would be no chance of a blank line being randomly sampled.
That said, it wouldn't be hard to add a switch along the lines of --multiple_captions to turn on the multi-caption behavior, so that existing behavior can be fully preserved. I like the idea of support for a more complex format, too, though for my purposes, just random sampling out of a multi-line file is sufficient :)
I understand the value of multi-line captions :)
I would like to add a command line option for this, so I can implement this myself? I also would like to release this feature at the same time as wildcard, which is implemented in the dev branch, so it shouldn't take too long.
The implementation will be simple, it will choice a line randomly.
Absolutely, implement however you'd like -- I'd just be happy to have it in the tool!
I have updated dev branch to support multiline captions. I hope you could test it😀
I have updated dev branch to support multiline captions. I hope you could test it😀
awesome
so we just add multiple lines and they work right?
so we just add multiple lines and they work right?
Please enable it with --enable_wildcard option or enable_wildcard = true in .toml.
https://github.com/kohya-ss/sd-scripts/blob/dev/docs/config_README-en.md#multi-line-captions
so we just add multiple lines and they work right?
Please enable it with
--enable_wildcardoption orenable_wildcard = truein .toml.https://github.com/kohya-ss/sd-scripts/blob/dev/docs/config_README-en.md#multi-line-captions
--enable_wildcard got it i never used toml so far :D
I tried --enable_wildcard out for multiline captions. Seems to work when I tried adding a second caption to most of my face images.
And it works well for training on faces, as adding additional captions to each face pic makes it generalize the face better, making the learned face show up more often, and more accurately, when inferencing new images with captions other than the ones in the training set.
The use of multiple captions also seems to make the training last for longer before the high-contrast appearance of overtraining sets in, although I'm not certain of this yet.
One thing is that I can't really see that it's enabled, other than the call to --enable_wildcard didn't get rejected. Maybe some debug line output like 'additional captions detected' would be useful if it's worked? It's not too important for this though, cause I'm pretty sure it's working.
~~Edit: ...except just as soon as I posted that I realized that I'm still on main, not the branch. So it wasn't running at all, and the 'improvements' I saw were just confirmation bias. I'll give the actual branch that has this feature a try tomorrow.~~
Edit2: I see now that kohya ported this change independently to their dev branch, rather than using this pull request. I'm using the dev branch, so it was indeed working for me - and my comments about improvements were likely correct, rather than being confirmation bias. Perhaps some debug output to say that multiple captions are detected would be useful after all?
hello again. what is the latest status of this? i want to today use this multi line feature. what do i need to do? write each line a different captions and it will randomly read? @kohya-ss @araleza @cheald
hello again. what is the latest status of this? i want to today use this multi line feature. what do i need to do? write each line a different captions and it will randomly read? @kohya-ss @araleza @cheald
It's available in the main branch. You just pass --enable_wildcard as a command line parameter.
Each line in your caption file then becomes its own caption, and captions will be chosen at random each time that image comes up in the training loop. Blank lines are ignored.
I got very good results from multi-line captions. Previously I thought my model had become overtrained/burned because I didn't have enough images, but adding extra captions has some of the same effect as adding entirely new images.
@araleza ty so much time to test it
hello again. what is the latest status of this? i want to today use this multi line feature. what do i need to do? write each line a different captions and it will randomly read? @kohya-ss @araleza @cheald
It's available in the main branch. You just pass
--enable_wildcardas a command line parameter.Each line in your caption file then becomes its own caption, and captions will be chosen at random each time that image comes up in the training loop. Blank lines are ignored.
I got very good results from multi-line captions. Previously I thought my model had become overtrained/burned because I didn't have enough images, but adding extra captions has some of the same effect as adding entirely new images.
thanks @araleza ! This could be very useful. Is there an explanation of the enable_wildcard syntax somewhere? (or example) It looks like a very powerful tool to generate caption diversity and avoid overcooking the text encoder.
thanks @araleza ! This could be very useful. Is there an explanation of the enable_wildcard syntax somewhere? (or example) It looks like a very powerful tool to generate caption diversity and avoid overcooking the text encoder.
Please see this file: https://github.com/kohya-ss/sd-scripts/blob/main/docs/config_README-en.md#multi-line-captions