SimpleTuner icon indicating copy to clipboard operation
SimpleTuner copied to clipboard

Flux lora 4bit quantized training doesn't starts

Open whythisusername opened this issue 1 year ago • 13 comments

Crashes right before actual steps starting. I'll drop the full log and config, haven't found issue like this, would like to know how to fix it.

Log
/train.sh 
/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/nvidia/nvjitlink/lib
2024-08-11 14:32:29,095 [WARNING] (ArgsParser) The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-08-11 14:32:29,095 [INFO] (ArgsParser) VAE Model: black-forest-labs/FLUX.1-dev
2024-08-11 14:32:29,095 [INFO] (ArgsParser) Default VAE Cache location: 
2024-08-11 14:32:29,095 [INFO] (ArgsParser) Text Cache location: cache
2024-08-11 14:32:29,095 [WARNING] (ArgsParser) Updating T5 XXL tokeniser max length to 512 for Flux.
2024-08-11 14:32:29,095 [WARNING] (ArgsParser) Flux Dev expects around 28 or fewer inference steps. Consider limiting --validation_num_inference_steps to 28.
2024-08-11 14:32:29,318 [WARNING] (__main__) If using an Ada or Ampere NVIDIA device, --allow_tf32 could add a bit more performance.
2024-08-11 14:32:29,318 [INFO] (__main__) Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-08-11 14:32:30,429 [INFO] (__main__) Loading OpenAI CLIP-L text encoder from black-forest-labs/FLUX.1-dev/text_encoder..
2024-08-11 14:32:30,769 [INFO] (__main__) Loading T5 XXL v1.1 text encoder from black-forest-labs/FLUX.1-dev/text_encoder_2..
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 25420.02it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.86it/s]
2024-08-11 14:32:33,475 [INFO] (__main__) Load VAE: black-forest-labs/FLUX.1-dev
2024-08-11 14:32:33,813 [INFO] (__main__) Moving text encoder to GPU.
2024-08-11 14:32:33,864 [INFO] (__main__) Moving text encoder 2 to GPU.
2024-08-11 14:32:34,309 [INFO] (__main__) Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16
2024-08-11 14:32:34,327 [INFO] (DataBackendFactory) Loading data backend config from config/multidatabackend.json
2024-08-11 14:32:34,328 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds
Loading pipeline components...:   0%|                                                                                  | 0/5 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1848.20it/s]
2024-08-11 14:32:34,638 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-08-11 14:32:34,640 [INFO] (DataBackendFactory) Pre-computing null embedding
2024-08-11 14:32:39,646 [WARNING] (DataBackendFactory) Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption-dropout_probability=0.1 as a recommended value.
2024-08-11 14:32:39,646 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-08-11 14:32:39,646 [INFO] (DataBackendFactory) Configuring data backend: test-flux-v1
2024-08-11 14:32:39,647 [INFO] (DataBackendFactory) (id=test-flux-v1) Loading bucket manager.
2024-08-11 14:32:39,648 [INFO] (JsonMetadataBackend) Checking for cache file: /disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1/aspect_ratio_bucket_indices.json
2024-08-11 14:32:39,649 [INFO] (JsonMetadataBackend) Pulling cache file from storage
2024-08-11 14:32:39,649 [INFO] (DataBackendFactory) (id=test-flux-v1) Refreshing aspect buckets on main process.
2024-08-11 14:32:39,649 [INFO] (BaseMetadataBackend) Discovering new files...
2024-08-11 14:32:39,650 [INFO] (BaseMetadataBackend) Compressed 135 existing files from 15.
2024-08-11 14:32:39,650 [INFO] (BaseMetadataBackend) No new files discovered. Doing nothing.
2024-08-11 14:32:39,650 [INFO] (BaseMetadataBackend) Statistics: {'total_processed': 0, 'skipped': {'already_exists': 135, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key crop_aspect not found in the current backend config, using the existing value 'square'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key crop_style not found in the current backend config, using the existing value 'random'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key disable_validation not found in the current backend config, using the existing value 'False'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key config_version not found in the current backend config, using the existing value '2'.
2024-08-11 14:32:39,652 [WARNING] (DataBackendFactory) Key hash_filenames not found in the current backend config, using the existing value 'True'.
2024-08-11 14:32:39,652 [INFO] (DataBackendFactory) Configured backend: {'id': 'test-flux-v1', 'config': {'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7f3f0ee104f0>, 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7f3f0ee10a30>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 0.78       | 10          
(Rank: 0)  | 1.0        | 6           
(Rank: 0)  | 1.13       | 6           
(Rank: 0)  | 1.29       | 9           
(Rank: 0)  | 1.46       | 8           
(Rank: 0)  | 0.68       | 15          
(Rank: 0)  | 0.88       | 12          
(Rank: 0)  | 0.6        | 10          
(Rank: 0)  | 0.65       | 33          
(Rank: 0)  | 0.55       | 2           
(Rank: 0)  | 1.36       | 1           
(Rank: 0)  | 0.57       | 10          
(Rank: 0)  | 1.75       | 2           
(Rank: 0)  | 0.74       | 7           
(Rank: 0)  | 1.54       | 4           
2024-08-11 14:32:39,653 [INFO] (DataBackendFactory) (id=test-flux-v1) Collecting captions.
2024-08-11 14:32:39,653 [INFO] (DataBackendFactory) (id=test-flux-v1) Initialise text embed pre-computation using the filename caption strategy. We have 135 captions to process.
2024-08-11 14:32:39,667 [INFO] (DataBackendFactory) (id=test-flux-v1) Completed processing 135 captions.
2024-08-11 14:32:39,667 [INFO] (DataBackendFactory) (id=test-flux-v1) Creating VAE latent cache.
2024-08-11 14:32:39,668 [INFO] (DataBackendFactory) (id=test-flux-v1) Discovering cache objects..
2024-08-11 14:32:39,673 [INFO] (DataBackendFactory) Configured backend: {'id': 'test-flux-v1', 'config': {'repeats': 5, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7f3f0ee104f0>, 'instance_data_dir': '/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/datasets/test-flux-v1', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7f3f0ee10a30>, 'train_dataset': <helpers.multiaspect.dataset.MultiAspectDataset object at 0x7f3f0ee2d420>, 'sampler': <helpers.multiaspect.sampler.MultiAspectSampler object at 0x7f3f0ee2d840>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x7f3f0ee2d7e0>, 'text_embed_cache': <helpers.caching.text_embeds.TextEmbeddingCache object at 0x7f3f142525c0>, 'vaecache': <helpers.caching.vae.VAECache object at 0x7f3f0ee10190>}
2024-08-11 14:32:39,951 [INFO] (validation) Precomputing the negative prompt embed for validations.
2024-08-11 14:32:40,111 [INFO] (__main__) Unloading text encoders, as they are not being trained.
2024-08-11 14:32:41,442 [INFO] (__main__) After nuking text encoders from orbit, we freed 9.11 GB of VRAM. The real memories were the friends we trained a model on along the way.
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1874.13it/s]
2024-08-11 14:32:42,282 [INFO] (__main__) Keeping some base model weights in torch.bfloat16.
2024-08-11 14:32:42,283 [INFO] (helpers.training.quantisation) Loading Quanto for LoRA training. This may take a few minutes.
2024-08-11 14:32:42,283 [INFO] (helpers.training.quantisation) Quantising FluxTransformer2DModel. Using int4-quanto.
2024-08-11 14:33:17,501 [INFO] (helpers.training.quantisation) Freezing model.
2024-08-11 14:33:24,342 [INFO] (__main__) Using LoRA training mode (rank=16)
2024-08-11 14:33:24,509 [INFO] (__main__) Collected the following data backends: ['text-embeds', 'test-flux-v1']
2024-08-11 14:33:24,510 [INFO] (__main__) Loading cosine learning rate scheduler with 250 warmup steps
2024-08-11 14:33:24,514 [INFO] (__main__) Learning rate: 0.0001
2024-08-11 14:33:24,514 [INFO] (__main__) Using bf16 AdamW optimizer with stochastic rounding.
2024-08-11 14:33:24,518 [INFO] (__main__) Optimizer arguments, weight_decay=0.01 eps=1e-08, extra_arguments={'weight_decay': 0.01, 'eps': 1e-08, 'betas': (0.9, 0.999), 'lr': 0.0001}
2024-08-11 14:33:24,518 [INFO] (__main__) Loading cosine learning rate scheduler with 250 warmup steps
2024-08-11 14:33:24,518 [INFO] (__main__) Using Cosine learning rate scheduler.
2024-08-11 14:33:24,519 [INFO] (SaveHookManager) Denoiser class set to: ControlNetModel.
2024-08-11 14:33:24,520 [INFO] (SaveHookManager) Pipeline class set to: FluxPipeline.
2024-08-11 14:33:24,520 [INFO] (__main__) Loading our accelerator...
2024-08-11 14:38:25,374 [INFO] (__main__) After removing any undesired samples and updating cache entries, we have settled on 13 epochs and 810 steps per epoch.
2024-08-11 14:38:25,453 [INFO] (__main__) After nuking the VAE from orbit, we freed 163.84 MB of VRAM.
2024-08-11 14:38:25,453 [INFO] (__main__) Checkpoint 'latest' does not exist. Starting a new training run.
2024-08-11 14:38:25,453 [INFO] (MultiAspectSampler-test-flux-v1) 
(Rank: 0)     -> Number of seen images: 0
(Rank: 0)     -> Number of unseen images: 135
(Rank: 0)     -> Current Bucket: None
(Rank: 0)     -> 15 Buckets: ['0.78', '1.0', '1.13', '1.29', '1.46', '0.68', '0.88', '0.6', '0.65', '0.55', '1.36', '0.57', '1.75', '0.74', '1.54']
(Rank: 0)     -> 0 Exhausted Buckets: []
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 1c6e9f5d0a315071c3f722c5dabbb83b.
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
2024-08-11 14:38:32,133 [INFO] (__main__) Moving the diffusion transformer to GPU in int4-quanto precision.
2024-08-11 14:38:32,180 [INFO] (__main__) 
***** Running training *****
-  Num batches = 810
-  Num Epochs = 13
  - Current Epoch = 1
-  Total train batch size (w. parallel, distributed & accumulation) = 1
  - Instantaneous batch size per device = 1
  - Gradient Accumulation steps = 1
-  Total optimization steps = 10000
-  Total optimization steps remaining = 10000
Epoch 1/13, Steps:   0%|                                                                           | 0/10000 [00:00<?, ?it/s]Expected A.dtype() == at::kBFloat16 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Traceback (most recent call last):
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/train.py", line 2751, in <module>
    main()
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/train.py", line 2077, in main
    model_pred = transformer(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 396, in forward
    encoder_hidden_states, hidden_states = torch.utils.checkpoint.checkpoint(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_compile.py", line 31, in inner
    return disable_fn(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
    return fn(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint
    ret = function(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 391, in custom_forward
    return module(*inputs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 200, in forward
    attn_output, context_attn_output = self.attn(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1800, in __call__
    query = attn.to_q(hidden_states)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/peft/tuners/lora/quanto.py", line 64, in forward
    result = self.base_layer(x)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/nn/qlinear.py", line 45, in forward
    return torch.nn.functional.linear(input, self.qweight, bias=self.bias)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/tensor/qtensor.py", line 90, in __torch_function__
    return qfunc(*args, **kwargs)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/tensor/qtensor_func.py", line 152, in linear
    return QTensorLinear.apply(input, other, bias)
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/optimum/quanto/tensor/qtensor_func.py", line 118, in forward
    output = torch._weight_int4pack_mm(
RuntimeError: Expected A.dtype() == at::kBFloat16 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Epoch 1/13, Steps:   0%|                                                                           | 0/10000 [00:00<?, ?it/s]
Config
# Configure these values.

# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='lora'

# SDXL is trained by default, but you will need to enable one of these options for anything else.

# Set this to 'true' if you are training a Stable Diffusion 3 checkpoint.
# Use MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export STABLE_DIFFUSION_3=false
# Similarly, this is to train PixArt Sigma (1K or 2K) models.
# Use MODEL_NAME="PixArt-alpha/PixArt-Sigma-XL-2-1024-MS"
export PIXART_SIGMA=false
# For old Stable Diffusion 1.x/2.x models, you'll enable this.
# Use MODEL_NAME="stabilityai/stable-diffusion-2-1"
export STABLE_DIFFUSION_LEGACY=false
# For Kwai-Kolors, enable KOLORS.
# Use MODEL_NAME="kwai-kolors/kolors-diffusers"
export KOLORS=false
# For Flux, if you have 8 GPUs and DeepSpeed configured.
# Use MODEL_NAME="black-forest-labs/FLUX.1-dev"
export FLUX=true

# ControlNet model training is only supported when MODEL_TYPE='full'
# See this document for more information: https://github.com/bghira/SimpleTuner/blob/main/documentation/CONTROLNET.md
# DeepFloyd, PixArt, and SD3 do not currently support ControlNet model training.
export CONTROLNET=false

# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false

# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
#if [[ "$MODEL_TYPE" == "full" ]]; then
#    # When training a full model, we will rely on BitFit to keep the u-net intact.
#    export USE_BITFIT=true
#elif [[ "$MODEL_TYPE" == "lora" ]]; then
#    # LoRA can not use BitFit.
#    export USE_BITFIT=false
#elif [[ "$MODEL_TYPE" == "deepfloyd-full" ]]; then
#    export USE_BITFIT=true
#fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=500
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=5

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=1e-4 #@param {type:"number"}

# Using a Huggingface Hub model:
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
#export DEBUG_EXTRA_ARGS="--report_to=wandb"
#export TRACKER_PROJECT_NAME="${MODEL_TYPE}-training"
#export TRACKER_RUN_NAME="simpletuner-sdxl"

# Max number of steps OR epochs can be used. Not both.
export MAX_NUM_STEPS=10000
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=0

# A convenient prefix for all of your training paths.
# These may be absolute or relative paths. Here, we are using relative paths.
# The output will just be in a folder called "output/models" by default.
export DATALOADER_CONFIG="config/multidatabackend.json"
export OUTPUT_DIR="output/models"

# Set this to "true" to push your model to Hugging Face Hub.
export PUSH_TO_HUB="false"
# If PUSH_TO_HUB and PUSH_CHECKPOINTS are both enabled, every saved checkpoint will be pushed to Hugging Face Hub.
export PUSH_CHECKPOINTS="true"
# This will be the model name for your final hub upload, eg. "yourusername/yourmodelname"
# It defaults to the wandb project name, but you can override this here.
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
export RESOLUTION=1024
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION_TYPE="pixel"
#export RESOLUTION=1          # 1.0 Megapixel training sizes
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# How many decimals to round aspect buckets to.
#export ASPECT_BUCKET_ROUNDING=2

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=7.5
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100000
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=$RESOLUTION


# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=1
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=1
# How many images to encode at once with the VAE. Can increase VRAM use.
export VAE_BATCH_SIZE=1

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="cosine"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=250

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0

export METADATA_UPDATE_INTERVAL=500

# How many workers to use for VAE caching.
export MAX_WORKERS=4
# Read and write batch sizes for VAE caching.
export READ_BATCH_SIZE=25
export WRITE_BATCH_SIZE=64
# How many images to process at once (resize, crop, transform) during VAE caching.
export IMAGE_PROCESSING_BATCH_SIZE=32
# When using large batch sizes, you'll need to increase the pool connection limit.
export AWS_MAX_POOL_CONNECTIONS=128
# For very large systems, setting this can reduce CPU overhead of torch spawning an unnecessarily large number of threads.
export TORCH_NUM_THREADS=8

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=0
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=false

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=true

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=false

# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# NOTE: When training a quantised base model, you can't use adamw_bf16. Instead, try adafactor or adamw.
# Choices: adamw, adamw8bit, adafactor, dadaptation, adamw_bf16
export OPTIMIZER="adamw_bf16"


# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
# NOTE: EMA is not currently applied to LoRA.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS="--lora_rank=16 --lora_alpha=16 --base_model_precision=int4-quanto --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change --text_encoder_lr=1e-5" # quant
#export TRAINER_EXTRA_ARGS="--lora_rank=16 --lora_alpha=16 --text_encoder_lr=1e-5" # no-quant
## For offset noise training:
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.02"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"


# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=42

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=1
#export TRAINING_NUM_PROCESSES=2 #2 or more for --multi_gpu
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)

export TOKENIZERS_PARALLELISM=false

whythisusername avatar Aug 11 '24 12:08 whythisusername

you need base model in bf16 for int4 training to work, which means using adamw_bf16 and setting PURE_BF16=true

i should update documentation to reflect this

bghira avatar Aug 11 '24 13:08 bghira

hmm, it looks actually like you do have the required options set.

export TRAINER_EXTRA_ARGS="--lora_rank=16 --lora_alpha=16 --base_model_precision=int4-quanto --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change --text_encoder_lr=1e-5 --base_model_default_dtype=bf16"

but does adding this param to the end here change anything?

bghira avatar Aug 11 '24 13:08 bghira

  File "/disks/nv7000-encrypted/SD/LoRA-training/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))

hmm.. i've seen this before... but where..

bghira avatar Aug 11 '24 13:08 bghira

but does adding this param to the end here change anything?

No, still the same error

whythisusername avatar Aug 11 '24 18:08 whythisusername

downgrade pytorch to 2.3.1

tanis2010 avatar Aug 14 '24 02:08 tanis2010

you... can't... it relies on pytorch 2.4 to use quanto

bghira avatar Aug 14 '24 05:08 bghira

downgrade them too, pytorch, xformers and quanto

tanis2010 avatar Aug 14 '24 05:08 tanis2010

just delete xformers at this point 😃

bghira avatar Aug 14 '24 05:08 bghira

I met same problem, and worked by downgrading, maybe not cased by pytorch version, cased by ther version of xformers or quanto

tanis2010 avatar Aug 14 '24 05:08 tanis2010

please try latest main

bghira avatar Aug 15 '24 12:08 bghira

I'm getting this on latest main with xformers installed, quanto, and pytorch 2.4. I'm just going to try int8-quanto instead.

sjuxax avatar Aug 16 '24 05:08 sjuxax

please try latest main

Still the same error for me

whythisusername avatar Aug 17 '24 00:08 whythisusername

downgrade pytorch to 2.3.1

@tanis2010 Can you give me a hint on commands you was using to downgrade everything to the 2.3.1 version to test that?

whythisusername avatar Aug 17 '24 00:08 whythisusername