TOML questions - validation loss and DoRA
1. Validation Dataset Setup (sd3 Branch Method):
- To set up a single folder of validation images with the https://github.com/kohya-ss/sd-scripts/pull/1864 updates, is configuring a
[[datasets]]block withvalidation_split = 1.0the correct approach? (I saw mentions of an olderis_validation = trueflag elsewhere, so wanted to confirm the intended method for this branch/feature). - Should the validation dataset use the same
batch_sizeas the main training dataset? - I'm using the sd3 branch because of the validation loss, does it still work fine to train SDXL?
- Do captions have any effect or purpose for images within the validation dataset block when calculating validation loss?
Here's my current dataset.toml structure:
# --- Training Dataset Definition ---
[[datasets]]
batch_size = 4
resolution = [1024, 1024]
enable_bucket = true
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64
bucket_no_upscale = true
[[datasets.subsets]]
image_dir = "path"
caption_extension = ".txt"
num_repeats = 10
shuffle_caption = true
keep_tokens = 1
flip_aug = true
random_crop = false
# --- Validation Dataset Definition ---
[[datasets]]
validation_split = 1.0
batch_size = 4
resolution = [1024, 1024]
enable_bucket = true
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64
bucket_no_upscale = true
[[datasets.subsets]]
image_dir = "val_path"
caption_extension = ".txt"
2. Correct DoRA Implementation:
- I've seen examples using LyCORIS (
network_module="lycoris.kohya", network_args=["algo=dora", ...]) but encountered aKeyError: 'dora'when trying that approach after installinglycoris-lora. - Is using the standard LoRA module with the
use_dora=Trueargument, like below, the correct and currently intended way to enable DoRA within thesd3branch?
# From config.toml
network_module = "networks.lora"
network_args = ["use_dora=True"]
# (other network_dim, network_alpha settings...)
Thanks for any clarification!
-
validation_split of 1.0 on separate dataset may work from what I was theorizing but i haven't tried it. Please let me know if it works as you expect.
-
Validation does not currently use batch size at all, so all a batch of 1. This could be improved by doing individual images of timesteps in a maximum amount of batch size to improve performance.
-
validation loss works across all network trainings (ones that use train_network.py or *_train_network.py scripts)
-
Validation loss works identically to regular training, except it isn't learning from those images. So captions work like training images, and should be accurate as you'd expect from training images.
-
DORA isn't supported in the networks from this library but should work with LyCORIS, and validation should work with DoRA. Would ask that on LyCORIS if you have a specific question but I think the docs you want are https://github.com/KohakuBlueleaf/LyCORIS/blob/main/docs/Network-Args.md#weight-decompose. So
network_args = ["dora_wd=True"]andnetwork_module = "lycoris.kohya"
Hi,
Validation appears to work with validation_split 1.0 for a folder, at least it's mentioned in the log that it got all the validation files, and has validation steps. I just wanted to make sure because the graph has been rather flat despite changing many things in the run, but this may just be a problem with this specific dataset.
Thank you for the other clarifications too.
Yeah could be that the learning isn't having enough impact of the validation to cause a big difference? Also having the validation not move as much might be ideal in that it's not adjusting the outside the training dataset too much. When it's rising there might be something where it's moving away from the validation dataset though. But as you mention, the numbers are relatively small in this example. I think having the validation dataset be a diverse dataset but maybe closer related to your training dataset might show more movement but the goals might be different. That's maybe why it might be important to use validation split of your training dataset so it includes images of your training and see if it generalizes to the 2 separate cases. Maybe it'd be important to separately log the different validation datasets so more distinction could be made across different validation datasets.