Flux.2-dev training does not work on my big computer
This is for bugs only
Did you already ask in the discord?
Yes/No
You verified that this is a bug and not a feature request or question by asking in the discord?
Yes/No
Describe the bug
For some reasons
- The training eats up 140 GB/192 GB of my ram.
- It fills my whole 5090's VRAM (I have two 5090s but the script won't use the second one. This is not a problem but it's just to say that I have more than enough compute power for the training to work.)
- The VRAM seems to not be enough given how it eats up the shared memory too.
- My computer lags like crazy even to type this text.
It has been one hour that my GPU has been generating the first 1024x1024 image pre-training, and it's still not done. I don't know if it'll be done somedays.
I checked
- low vram
- float8 quantization (I let the quantization options by default)
- cache latents to disk
- I allowed bucket size of 1280 and 1536
My dataset contains 141 labeled images.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.80 Driver Version: 581.80 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 WDDM | 00000000:01:00.0 On | N/A |
| 0% 29C P1 111W / 575W | 31916MiB / 32607MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 WDDM | 00000000:03:00.0 Off | N/A |
| 0% 25C P8 12W / 575W | 837MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Given how the power stays low even with 100% usage, I suspect that most of the GPU's usage are PCIE communications (the VRAM is overflowing)
Here is the job config to reproduce.
{
"job": "extension",
"config": {
"name": "test",
"process": [
{
"type": "diffusion_trainer",
"training_folder": "C:\\Users\\asuka\\sc\\ia\\AI-Toolkit-Easy-Install\\AI-Toolkit\\output",
"sqlite_db_path": "C:\\Users\\asuka\\sc\\ia\\AI-Toolkit-Easy-Install\\AI-Toolkit\\aitk_db.db",
"device": "cuda",
"trigger_word": null,
"performance_log_every": 10,
"network": {
"type": "lora",
"linear": 32,
"linear_alpha": 32,
"conv": 16,
"conv_alpha": 16,
"lokr_full_rank": true,
"lokr_factor": -1,
"network_kwargs": {
"ignore_if_contains": []
}
},
"save": {
"dtype": "bf16",
"save_every": 250,
"max_step_saves_to_keep": 4,
"save_format": "diffusers",
"push_to_hub": false
},
"datasets": [
{
"folder_path": "C:\\Users\\asuka\\sc\\ia\\AI-Toolkit-Easy-Install\\AI-Toolkit\\datasets/test",
"mask_path": null,
"mask_min_value": 0.1,
"default_caption": "",
"caption_ext": "txt",
"caption_dropout_rate": 0.05,
"cache_latents_to_disk": true,
"is_reg": false,
"network_weight": 1,
"resolution": [
768,
1024,
1280,
1536,
512
],
"controls": [],
"shrink_video_to_frames": true,
"num_frames": 1,
"do_i2v": true,
"flip_x": false,
"flip_y": false,
"control_path_1": null,
"control_path_2": null,
"control_path_3": null
}
],
"train": {
"batch_size": 1,
"bypass_guidance_embedding": false,
"steps": 3000,
"gradient_accumulation": 1,
"train_unet": true,
"train_text_encoder": false,
"gradient_checkpointing": true,
"noise_scheduler": "flowmatch",
"optimizer": "adamw8bit",
"timestep_type": "weighted",
"content_or_style": "balanced",
"optimizer_params": {
"weight_decay": 0.0001
},
"unload_text_encoder": false,
"cache_text_embeddings": false,
"lr": 0.0001,
"ema_config": {
"use_ema": false,
"ema_decay": 0.99
},
"skip_first_sample": false,
"force_first_sample": false,
"disable_sampling": false,
"dtype": "bf16",
"diff_output_preservation": false,
"diff_output_preservation_multiplier": 1,
"diff_output_preservation_class": "person",
"switch_boundary_every": 1,
"loss_type": "mse"
},
"model": {
"name_or_path": "black-forest-labs/FLUX.2-dev",
"quantize": true,
"qtype": "qfloat8",
"quantize_te": true,
"qtype_te": "qfloat8",
"arch": "flux2",
"low_vram": true,
"model_kwargs": {
"match_target_res": false
},
"layer_offloading": false,
"layer_offloading_text_encoder_percent": 1,
"layer_offloading_transformer_percent": 1
},
"sample": {
"sampler": "flowmatch",
"sample_every": 250,
"width": 1024,
"height": 1024,
"samples": [
{
"prompt": "A cat."
}
],
"neg": "",
"seed": 42,
"walk_seed": true,
"guidance_scale": 4,
"sample_steps": 25,
"num_frames": 1,
"fps": 1
}
}
]
},
"meta": {
"name": "[name]",
"version": "1.0"
}
}
I was able to get this working by removing all samples (so it just trained) and setting the quantization for the transformer and text encoder both to 3. Also turned the rank down to 8 (from 32). My images were all 512x512 and took about 6 hours to do 3k steps, but it definitely learned the style I was trying to teach it. (5090)
Try layer offloading. I offload text encoder 100% and transformer 90%. Rank 16, Quants float8 text encoder, 6 for transformer. Not shure how it will affect training quality, cause i just started. it doing 8 sec\it. 5090 +128RAM. Upd: it stopping after some steps. So trying to offload 100% of everything and disable sampling Upd2: Nearly perfectly trained char lora in 1500 steps. Best model for training to me. Impressed.