ai-toolkit Flux.2-dev training does not work on my big computer

This is for bugs only

Did you already ask in the discord?

Yes/No

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes/No

Describe the bug

For some reasons

The training eats up 140 GB/192 GB of my ram.
It fills my whole 5090's VRAM (I have two 5090s but the script won't use the second one. This is not a problem but it's just to say that I have more than enough compute power for the training to work.)
The VRAM seems to not be enough given how it eats up the shared memory too.
My computer lags like crazy even to type this text.

It has been one hour that my GPU has been generating the first 1024x1024 image pre-training, and it's still not done. I don't know if it'll be done somedays.

I checked

low vram
float8 quantization (I let the quantization options by default)
cache latents to disk
I allowed bucket size of 1280 and 1536

My dataset contains 141 labeled images.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.80                 Driver Version: 581.80         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   29C    P1            111W /  575W |   31916MiB /  32607MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090      WDDM  |   00000000:03:00.0 Off |                  N/A |
|  0%   25C    P8             12W /  575W |     837MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Given how the power stays low even with 100% usage, I suspect that most of the GPU's usage are PCIE communications (the VRAM is overflowing)

Here is the job config to reproduce.

{
  "job": "extension",
  "config": {
    "name": "test",
    "process": [
      {
        "type": "diffusion_trainer",
        "training_folder": "C:\\Users\\asuka\\sc\\ia\\AI-Toolkit-Easy-Install\\AI-Toolkit\\output",
        "sqlite_db_path": "C:\\Users\\asuka\\sc\\ia\\AI-Toolkit-Easy-Install\\AI-Toolkit\\aitk_db.db",
        "device": "cuda",
        "trigger_word": null,
        "performance_log_every": 10,
        "network": {
          "type": "lora",
          "linear": 32,
          "linear_alpha": 32,
          "conv": 16,
          "conv_alpha": 16,
          "lokr_full_rank": true,
          "lokr_factor": -1,
          "network_kwargs": {
            "ignore_if_contains": []
          }
        },
        "save": {
          "dtype": "bf16",
          "save_every": 250,
          "max_step_saves_to_keep": 4,
          "save_format": "diffusers",
          "push_to_hub": false
        },
        "datasets": [
          {
            "folder_path": "C:\\Users\\asuka\\sc\\ia\\AI-Toolkit-Easy-Install\\AI-Toolkit\\datasets/test",
            "mask_path": null,
            "mask_min_value": 0.1,
            "default_caption": "",
            "caption_ext": "txt",
            "caption_dropout_rate": 0.05,
            "cache_latents_to_disk": true,
            "is_reg": false,
            "network_weight": 1,
            "resolution": [
              768,
              1024,
              1280,
              1536,
              512
            ],
            "controls": [],
            "shrink_video_to_frames": true,
            "num_frames": 1,
            "do_i2v": true,
            "flip_x": false,
            "flip_y": false,
            "control_path_1": null,
            "control_path_2": null,
            "control_path_3": null
          }
        ],
        "train": {
          "batch_size": 1,
          "bypass_guidance_embedding": false,
          "steps": 3000,
          "gradient_accumulation": 1,
          "train_unet": true,
          "train_text_encoder": false,
          "gradient_checkpointing": true,
          "noise_scheduler": "flowmatch",
          "optimizer": "adamw8bit",
          "timestep_type": "weighted",
          "content_or_style": "balanced",
          "optimizer_params": {
            "weight_decay": 0.0001
          },
          "unload_text_encoder": false,
          "cache_text_embeddings": false,
          "lr": 0.0001,
          "ema_config": {
            "use_ema": false,
            "ema_decay": 0.99
          },
          "skip_first_sample": false,
          "force_first_sample": false,
          "disable_sampling": false,
          "dtype": "bf16",
          "diff_output_preservation": false,
          "diff_output_preservation_multiplier": 1,
          "diff_output_preservation_class": "person",
          "switch_boundary_every": 1,
          "loss_type": "mse"
        },
        "model": {
          "name_or_path": "black-forest-labs/FLUX.2-dev",
          "quantize": true,
          "qtype": "qfloat8",
          "quantize_te": true,
          "qtype_te": "qfloat8",
          "arch": "flux2",
          "low_vram": true,
          "model_kwargs": {
            "match_target_res": false
          },
          "layer_offloading": false,
          "layer_offloading_text_encoder_percent": 1,
          "layer_offloading_transformer_percent": 1
        },
        "sample": {
          "sampler": "flowmatch",
          "sample_every": 250,
          "width": 1024,
          "height": 1024,
          "samples": [
            {
              "prompt": "A cat."
            }
          ],
          "neg": "",
          "seed": 42,
          "walk_seed": true,
          "guidance_scale": 4,
          "sample_steps": 25,
          "num_frames": 1,
          "fps": 1
        }
      }
    ]
  },
  "meta": {
    "name": "[name]",
    "version": "1.0"
  }
}

Nov 27 '25 23:11 EugeoSynthesisThirtyTwo

I was able to get this working by removing all samples (so it just trained) and setting the quantization for the transformer and text encoder both to 3. Also turned the rank down to 8 (from 32). My images were all 512x512 and took about 6 hours to do 3k steps, but it definitely learned the style I was trying to teach it. (5090)

Dec 18 '25 13:12 indianaorz

Try layer offloading. I offload text encoder 100% and transformer 90%. Rank 16, Quants float8 text encoder, 6 for transformer. Not shure how it will affect training quality, cause i just started. it doing 8 sec\it. 5090 +128RAM. Upd: it stopping after some steps. So trying to offload 100% of everything and disable sampling Upd2: Nearly perfectly trained char lora in 1500 steps. Best model for training to me. Impressed.

Dec 21 '25 23:12 magicalfourlimbs