SimpleTuner Flux lora 4bit quantized training doesn't starts

Crashes right before actual steps starting. I'll drop the full log and config, haven't found issue like this, would like to know how to fix it.

Log

/train.sh 
/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/nvidia/nvjitlink/lib
2024-08-11 14:32:29,095 [WARNING] (ArgsParser) The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-08-11 14:32:29,095 [INFO] (ArgsParser) VAE Model: black-forest-labs/FLUX.1-dev
2024-08-11 14:32:29,095 [INFO] (ArgsParser) Default VAE Cache location: 
2024-08-11 14:32:29,095 [INFO] (ArgsParser) Text Cache location: cache
2024-08-11 14:32:29,095 [WARNING] (ArgsParser) Updating T5 XXL tokeniser max length to 512 for Flux.
2024-08-11 14:32:29,095 [WARNING] (ArgsParser) Flux Dev expects around 28 or fewer inference steps. Consider limiting --validation_num_inference_steps to 28.
2024-08-11 14:32:29,318 [WARNING] (__main__) If using an Ada or Ampere NVIDIA device, --allow_tf32 could add a bit more performance.
2024-08-11 14:32:29,318 [INFO] (__main__) Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-08-11 14:32:30,429 [INFO] (__main__) Loading OpenAI CLIP-L text encoder from black-forest-labs/FLUX.1-dev/text_encoder..
2024-08-11 14:32:30,769 [INFO] (__main__) Loading T5 XXL v1.1 text encoder from black-forest-labs/FLUX.1-dev/text_encoder_2..
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 25420.02it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.86it/s]
2024-08-11 14:32:33,475 [INFO] (__main__) Load VAE: black-forest-labs/FLUX.1-dev
2024-08-11 14:32:33,813 [INFO] (__main__) Moving text encoder to GPU.
2024-08-11 14:32:33,864 [INFO] (__main__) Moving text encoder 2 to GPU.
2024-08-11 14:32:34,309 [INFO] (__main__) Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16
2024-08-11 14:32:34,327 [INFO] (DataBackendFactory) Loading data backend config from config/multidatabackend.json
2024-08-11 14:32:34,328 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds
Loading pipeline components...:   0%|                                                                                  | 0/5 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1848.20it/s]
2024-08-11 14:32:34,638 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-08-11 14:32:34,640 [INFO] (DataBackendFactory) Pre-computing null embedding
2024-08-11 14:32:39,646 [WARNING] (DataBackendFactory) Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption-dropout_probability=0.1 as a recommended value.
2024-08-11 14:32:39,646 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-08-11 14:32:39,646 [INFO] (DataBackendFactory) Configuring data backend: test-flux-v1
2024-08-11 14:32:39,647 [INFO] (DataBackendFactory) (id=test-flux-v1) Loading bucket manager.
2024-08-11 14:32:39,648 [INFO] (JsonMetadataBackend) Checking for cache file: /disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1/aspect_ratio_bucket_indices.json
2024-08-11 14:32:39,649 [INFO] (JsonMetadataBackend) Pulling cache file from storage
2024-08-11 14:32:39,649 [INFO] (DataBackendFactory) (id=test-flux-v1) Refreshing aspect buckets on main process.
2024-08-11 14:32:39,649 [INFO] (BaseMetadataBackend) Discovering new files...
2024-08-11 14:32:39,650 [INFO] (BaseMetadataBackend) Compressed 135 existing files from 15.
2024-08-11 14:32:39,650 [INFO] (BaseMetadataBackend) No new files discovered. Doing nothing.
2024-08-11 14:32:39,650 [INFO] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 135, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key crop_aspect not found in the current backend config, using the existing value 'square'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key crop_style not found in the current backend config, using the existing value 'random'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key disable_validation not found in the current backend config, using the existing value 'False'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key config_version not found in the current backend config, using the existing value '2'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key hash_filenames not found in the current backend config, using the existing value 'True'.
2024-08-11 14:32:39,652 [INFO] (DataBackendFactory) Configured backend: {'id': 'test-flux-v1', 'config': {'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7f3f0ee104f0>, 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7f3f0ee10a30>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 0.78       | 10          
(Rank: 0)  | 1.0        | 6           
(Rank: 0)  | 1.13       | 6           
(Rank: 0)  | 1.29       | 9           
(Rank: 0)  | 1.46       | 8           
(Rank: 0)  | 0.68       | 15          
(Rank: 0)  | 0.88       | 12          
(Rank: 0)  | 0.6        | 10          
(Rank: 0)  | 0.65       | 33          
(Rank: 0)  | 0.55       | 2           
(Rank: 0)  | 1.36       | 1           
(Rank: 0)  | 0.57       | 10          
(Rank: 0)  | 1.75       | 2           
(Rank: 0)  | 0.74       | 7           
(Rank: 0)  | 1.54       | 4           
2024-08-11 14:32:39,653 [INFO] (DataBackendFactory) (id=test-flux-v1) Collecting captions.
2024-08-11 14:32:39,653 [INFO] (DataBackendFactory) (id=test-flux-v1) Initialise text embed pre-computation using the filename caption strategy. We have 135 captions to process.
2024-08-11 14:32:39,667 [INFO] (DataBackendFactory) (id=test-flux-v1) Completed processing 135 captions.
2024-08-11 14:32:39,667 [INFO] (DataBackendFactory) (id=test-flux-v1) Creating VAE latent cache.
2024-08-11 14:32:39,668 [INFO] (DataBackendFactory) (id=test-flux-v1) Discovering cache objects..
2024-08-11 14:32:39,673 [INFO] (DataBackendFactory) Configured backend: {'id': 'test-flux-v1', 'config': {'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7f3f0ee104f0>, 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7f3f0ee10a30>, 'train_dataset': <helpers.multiaspect.dataset.MultiAspectDataset object at 0x7f3f0ee2d420>, 'sampler': <helpers.multiaspect.sampler.MultiAspectSampler object at 0x7f3f0ee2d840>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x7f3f0ee2d7e0>, 'text_embed_cache': <helpers.caching.text_embeds.TextEmbeddingCache object at 0x7f3f142525c0>, 'vaecache': <helpers.caching.vae.VAECache object at 0x7f3f0ee10190>}
2024-08-11 14:32:39,951 [INFO] (validation) Precomputing the negative prompt embed for validations.
2024-08-11 14:32:40,111 [INFO] (__main__) Unloading text encoders, as they are not being trained.
2024-08-11 14:32:41,442 [INFO] (__main__) After nuking text encoders from orbit, we freed 9.11 GB of VRAM. The real memories were the friends we trained a model on along the way.
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1874.13it/s]
2024-08-11 14:32:42,282 [INFO] (__main__) Keeping some base model weights in torch.bfloat16.
2024-08-11 14:32:42,283 [INFO] (helpers.training.quantisation) Loading Quanto for LoRA training. This may take a few minutes.
2024-08-11 14:32:42,283 [INFO] (helpers.training.quantisation) Quantising FluxTransformer2DModel. Using int4-quanto.
2024-08-11 14:33:17,501 [INFO] (helpers.training.quantisation) Freezing model.
2024-08-11 14:33:24,342 [INFO] (__main__) Using LoRA training mode (rank=16)
2024-08-11 14:33:24,509 [INFO] (__main__) Collected the following data backends: ['text-embeds', 'test-flux-v1']
2024-08-11 14:33:24,510 [INFO] (__main__) Loading cosine learning rate scheduler with 250 warmup steps
2024-08-11 14:33:24,514 [INFO] (__main__) Learning rate: 0.0001
2024-08-11 14:33:24,514 [INFO] (__main__) Using bf16 AdamW optimizer with stochastic rounding.
2024-08-11 14:33:24,518 [INFO] (__main__) Optimizer arguments, weight_decay=0.01 eps=1e-08, extra_arguments={'weight_decay': 0.01, 'eps': 1e-08, 'betas': (0.9, 0.999), 'lr': 0.0001}
2024-08-11 14:33:24,518 [INFO] (__main__) Loading cosine learning rate scheduler with 250 warmup steps
2024-08-11 14:33:24,518 [INFO] (__main__) Using Cosine learning rate scheduler.
2024-08-11 14:33:24,519 [INFO] (SaveHookManager) Denoiser class set to: ControlNetModel.
2024-08-11 14:33:24,520 [INFO] (SaveHookManager) Pipeline class set to: FluxPipeline.
2024-08-11 14:33:24,520 [INFO] (__main__) Loading our accelerator...
2024-08-11 14:38:25,374 [INFO] (__main__) After removing any undesired samples and updating cache entries, we have settled on 13 epochs and 810 steps per epoch.
2024-08-11 14:38:25,453 [INFO] (__main__) After nuking the VAE from orbit, we freed 163.84 MB of VRAM.
2024-08-11 14:38:25,453 [INFO] (__main__) Checkpoint 'latest' does not exist. Starting a new training run.
2024-08-11 14:38:25,453 [INFO] (MultiAspectSampler-test-flux-v1) 
(Rank: 0)     -> Number of seen images: 0
(Rank: 0)     -> Number of unseen images: 135
(Rank: 0)     -> Current Bucket: None
(Rank: 0)     -> 15 Buckets: ['0.78', '1.0', '1.13', '1.29', '1.46', '0.68', '0.88', '0.6', '0.65', '0.55', '1.36', '0.57', '1.75', '0.74', '1.54']
(Rank: 0)     -> 0 Exhausted Buckets: []
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 1c6e9f5d0a315071c3f722c5dabbb83b.
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
2024-08-11 14:38:32,133 [INFO] (__main__) Moving the diffusion transformer to GPU in int4-quanto precision.
2024-08-11 14:38:32,180 [INFO] (__main__) 
***** Running training *****
-  Num batches = 810
-  Num Epochs = 13
  - Current Epoch = 1
-  Total train batch size (w. parallel, distributed & accumulation) = 1
  - Instantaneous batch size per device = 1
  - Gradient Accumulation steps = 1
-  Total optimization steps = 10000
-  Total optimization steps remaining = 10000
Epoch 1/13, Steps:   0%|                                                                           | 0/10000 [00:00<?, ?it/s]Expected A.dtype() == at::kBFloat16 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Traceback (most recent call last):
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/train.py", line 2751, in <module>
    main()
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/train.py", line 2077, in main
    model_pred = transformer(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 396, in forward
    encoder_hidden_states, hidden_states = torch.utils.checkpoint.checkpoint(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_compile.py", line 31, in inner
    return disable_fn(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
    return fn(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint
    ret = function(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 391, in custom_forward
    return module(*inputs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 200, in forward
    attn_output, context_attn_output = self.attn(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1800, in __call__
    query = attn.to_q(hidden_states)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/peft/tuners/lora/quanto.py", line 64, in forward
    result = self.base_layer(x)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/nn/qlinear.py", line 45, in forward
    return torch.nn.functional.linear(input, self.qweight, bias=self.bias)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/tensor/qtensor.py", line 90, in __torch_function__
    return qfunc(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/tensor/qtensor_func.py", line 152, in linear
    return QTensorLinear.apply(input, other, bias)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/tensor/qtensor_func.py", line 118, in forward
    output = torch._weight_int4pack_mm(
RuntimeError: Expected A.dtype() == at::kBFloat16 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Epoch 1/13, Steps:   0%|                                                                           | 0/10000 [00:00<?, ?it/s]

Config

# Configure these values.

# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='lora'

# SDXL is trained by default, but you will need to enable one of these options for anything else.

# Set this to 'true' if you are training a Stable Diffusion 3 checkpoint.
# Use MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export STABLE_DIFFUSION_3=false
# Similarly, this is to train PixArt Sigma (1K or 2K) models.
# Use MODEL_NAME="PixArt-alpha/PixArt-Sigma-XL-2-1024-MS"
export PIXART_SIGMA=false
# For old Stable Diffusion 1.x/2.x models, you'll enable this.
# Use MODEL_NAME="stabilityai/stable-diffusion-2-1"
export STABLE_DIFFUSION_LEGACY=false
# For Kwai-Kolors, enable KOLORS.
# Use MODEL_NAME="kwai-kolors/kolors-diffusers"
export KOLORS=false
# For Flux, if you have 8 GPUs and DeepSpeed configured.
# Use MODEL_NAME="black-forest-labs/FLUX.1-dev"
export FLUX=true

# ControlNet model training is only supported when MODEL_TYPE='full'
# See this document for more information: https://github.com/bghira/SimpleTuner/blob/main/documentation/CONTROLNET.md
# DeepFloyd, PixArt, and SD3 do not currently support ControlNet model training.
export CONTROLNET=false

# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false

# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
#if [[ "$MODEL_TYPE" == "full" ]]; then
#    # When training a full model, we will rely on BitFit to keep the u-net intact.
#    export USE_BITFIT=true
#elif [[ "$MODEL_TYPE" == "lora" ]]; then
#    # LoRA can not use BitFit.
#    export USE_BITFIT=false
#elif [[ "$MODEL_TYPE" == "deepfloyd-full" ]]; then
#    export USE_BITFIT=true
#fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=500
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=5

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=1e-4 #@param {type:"number"}

# Using a Huggingface Hub model:
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
#export DEBUG_EXTRA_ARGS="--report_to=wandb"
#export TRACKER_PROJECT_NAME="${MODEL_TYPE}-training"
#export TRACKER_RUN_NAME="simpletuner-sdxl"

# Max number of steps OR epochs can be used. Not both.
export MAX_NUM_STEPS=10000
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=0

# A convenient prefix for all of your training paths.
# These may be absolute or relative paths. Here, we are using relative paths.
# The output will just be in a folder called "output/models" by default.
export DATALOADER_CONFIG="config/multidatabackend.json"
export OUTPUT_DIR="output/models"

# Set this to "true" to push your model to Hugging Face Hub.
export PUSH_TO_HUB="false"
# If PUSH_TO_HUB and PUSH_CHECKPOINTS are both enabled, every saved checkpoint will be pushed to Hugging Face Hub.
export PUSH_CHECKPOINTS="true"
# This will be the model name for your final hub upload, eg. "yourusername/yourmodelname"
# It defaults to the wandb project name, but you can override this here.
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
export RESOLUTION=1024
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION_TYPE="pixel"
#export RESOLUTION=1          # 1.0 Megapixel training sizes
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# How many decimals to round aspect buckets to.
#export ASPECT_BUCKET_ROUNDING=2

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=7.5
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100000
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION


# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=1
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=1
# How many images to encode at once with the VAE. Can increase VRAM use.
export VAE_BATCH_SIZE=1

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="cosine"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=250

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0

export METADATA_UPDATE_INTERVAL=500

# How many workers to use for VAE caching.
export MAX_WORKERS=4
# Read and write batch sizes for VAE caching.
export READ_BATCH_SIZE=25
export WRITE_BATCH_SIZE=64
# How many images to process at once (resize, crop, transform) during VAE caching.
export IMAGE_PROCESSING_BATCH_SIZE=32
# When using large batch sizes, you'll need to increase the pool connection limit.
export AWS_MAX_POOL_CONNECTIONS=128
# For very large systems, setting this can reduce CPU overhead of torch spawning an unnecessarily large number of threads.
export TORCH_NUM_THREADS=8

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=0
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=false

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=true

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=false

# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# NOTE: When training a quantised base model, you can't use adamw_bf16. Instead, try adafactor or adamw.
# Choices: adamw, adamw8bit, adafactor, dadaptation, adamw_bf16
export OPTIMIZER="adamw_bf16"


# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
# NOTE: EMA is not currently applied to LoRA.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS="--lora_rank=16 --lora_alpha=16 --base_model_precision=int4-quanto --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change --text_encoder_lr=1e-5" # quant
#export TRAINER_EXTRA_ARGS="--lora_rank=16 --lora_alpha=16 --text_encoder_lr=1e-5" # no-quant
## For offset noise training:
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.02"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"


# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=42

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=1
#export TRAINING_NUM_PROCESSES=2 #2 or more for --multi_gpu
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)

export TOKENIZERS_PARALLELISM=false

Aug 11 '24 12:08 whythisusername

you need base model in bf16 for int4 training to work, which means using adamw_bf16 and setting PURE_BF16=true

i should update documentation to reflect this

Aug 11 '24 13:08 bghira

hmm, it looks actually like you do have the required options set.

export TRAINER_EXTRA_ARGS="--lora_rank=16 --lora_alpha=16 --base_model_precision=int4-quanto --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change --text_encoder_lr=1e-5 --base_model_default_dtype=bf16"

but does adding this param to the end here change anything?

Aug 11 '24 13:08 bghira

  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))

hmm.. i've seen this before... but where..

Aug 11 '24 13:08 bghira

but does adding this param to the end here change anything?

No, still the same error

Aug 11 '24 18:08 whythisusername

downgrade pytorch to 2.3.1

Aug 14 '24 02:08 tanis2010

you... can't... it relies on pytorch 2.4 to use quanto

Aug 14 '24 05:08 bghira

downgrade them too, pytorch, xformers and quanto

Aug 14 '24 05:08 tanis2010

just delete xformers at this point 😃

Aug 14 '24 05:08 bghira

I met same problem, and worked by downgrading, maybe not cased by pytorch version, cased by ther version of xformers or quanto

Aug 14 '24 05:08 tanis2010

please try latest main

Aug 15 '24 12:08 bghira

I'm getting this on latest main with xformers installed, quanto, and pytorch 2.4. I'm just going to try int8-quanto instead.

Aug 16 '24 05:08 sjuxax

please try latest main

Still the same error for me

Aug 17 '24 00:08 whythisusername

downgrade pytorch to 2.3.1

@tanis2010 Can you give me a hint on commands you was using to downgrade everything to the 2.3.1 version to test that?

Aug 17 '24 00:08 whythisusername