SimpleTuner icon indicating copy to clipboard operation
SimpleTuner copied to clipboard

Sometimes training process dies with Signals.SIGKILL 9

Open Shed-The-Skin opened this issue 6 months ago • 15 comments

After setting up the repo with the FLUX quickstart guide, I ran a training session overnight with my RTX 4090 to find that it had died somewhere along the way. It seems to have died after processing bucket 1.0 , console output:

(.venv) vwing@vwing-Desktop:~/Documents/SimpleTuner$ bash train.sh
2024-08-05 21:09:57,307 [WARNING] (ArgsParser) The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-08-05 21:09:57,307 [INFO] (ArgsParser) VAE Model: black-forest-labs/FLUX.1-dev
2024-08-05 21:09:57,307 [INFO] (ArgsParser) Default VAE Cache location:
2024-08-05 21:09:57,307 [INFO] (ArgsParser) Text Cache location: cache
2024-08-05 21:09:57,307 [WARNING] (ArgsParser) Updating T5 XXL tokeniser max length to 512 for Flux.
2024-08-05 21:09:57,307 [WARNING] (ArgsParser) Gradient accumulation steps are enabled, but gradient precision is set to 'unmodified'. This may lead to numeric instability. Consider setting --gradient_precision=fp32.
2024-08-05 21:09:57,365 [INFO] (__main__) Enabling tf32 precision boost for NVIDIA devices due to --allow_tf32.
2024-08-05 21:09:57,365 [INFO] (__main__) Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-08-05 21:09:57,958 [INFO] (__main__) Loading OpenAI CLIP-L text encoder from black-forest-labs/FLUX.1-dev/text_encoder..
2024-08-05 21:09:58,215 [INFO] (__main__) Loading T5 XXL v1.1 text encoder from black-forest-labs/FLUX.1-dev/text_encoder_2..
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3514.29it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.59s/it]
2024-08-05 21:10:02,078 [INFO] (__main__) Load VAE: black-forest-labs/FLUX.1-dev
2024-08-05 21:10:02,315 [INFO] (__main__) Moving text encoder to GPU.
2024-08-05 21:10:02,517 [INFO] (__main__) Moving text encoder 2 to GPU.
2024-08-05 21:10:04,141 [INFO] (__main__) Initialising VAE in bf16 precision, you may specify a different value if preferred: bf16, fp32, default
2024-08-05 21:10:04,189 [INFO] (__main__) Loaded VAE into VRAM.
2024-08-05 21:10:04,217 [INFO] (DataBackendFactory) Loading data backend config from config/multidatabackend.json
2024-08-05 21:10:04,218 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds
model_index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 536/536 [00:00<00:00, 7.37MB/s]
Loading pipeline components...:   0%|                                                                                                                                                                 | 0/5 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 790.51it/s]
2024-08-05 21:10:04,394 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-08-05 21:10:04,394 [INFO] (DataBackendFactory) Pre-computing null embedding
2024-08-05 21:10:09,692 [INFO] (DataBackendFactory) Completed loading text embed services.
2024-08-05 21:10:09,692 [INFO] (DataBackendFactory) Configuring data backend: pseudo-camera-10k-flux
2024-08-05 21:10:09,692 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Loading bucket manager.
2024-08-05 21:10:09,693 [INFO] (JsonMetadataBackend) Checking for cache file: datasets/pseudo-camera-10k/aspect_ratio_bucket_indices.json
2024-08-05 21:10:09,693 [WARNING] (JsonMetadataBackend) No cache file found, creating new one.
2024-08-05 21:10:09,694 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Refreshing aspect buckets on main process.
2024-08-05 21:10:09,694 [INFO] (BaseMetadataBackend) Discovering new files...
2024-08-05 21:10:13,224 [INFO] (BaseMetadataBackend) Compressed 0 existing files from 0.
2024-08-05 21:11:59,472 [INFO] (BaseMetadataBackend) Image processing statistics: {'total_processed': 14102, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-08-05 21:11:59,513 [INFO] (BaseMetadataBackend) Enforcing minimum image size of 0.5. This could take a while for very-large datasets.
2024-08-05 21:11:59,553 [INFO] (BaseMetadataBackend) Completed aspect bucket update.
2024-08-05 21:11:59,569 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-flux', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': 'datasets/pseudo-camera-10k', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7fcb8a4a7df0>, 'instance_data_dir': 'datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7fcb8a4a7cd0>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 1.0        | 14080
2024-08-05 21:11:59,570 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Collecting captions.
2024-08-05 21:11:59,622 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Initialise text embed pre-computation using the filename caption strategy. We have 14102 captions to process.
2024-08-05 21:25:46,726 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Completed processing 14102 captions.
2024-08-05 21:25:46,727 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Creating VAE latent cache.
2024-08-05 21:25:46,834 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Discovering cache objects..
Processing bucket 1.0:   4%|██                                                         | 504/14080 [12:10<5:56:13,  1.57s/it]
Processing bucket 1.0:   5%|██▉                                                        | 710/14080 [17:22<5:41:32,  1.53s/it]
(id=pseudo-camera-10k-flux) Bucket 1.0 caching results: {'not_local': 0, 'already_cached': 4056, 'cached': 200, 'total': 14080}
2024-08-06 01:46:57,256 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-flux', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 1.0, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': 'datasets/pseudo-camera-10k', 'maximum_image_size': 1.0, 'target_downsample_size': 1.0, 'config_version': 2, 'hash_filenames': True}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7fcb8a4a7df0>, 'instance_data_dir': 'datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7fcb8a4a7cd0>, 'train_dataset': <helpers.multiaspect.dataset.MultiAspectDataset object at 0x7fcb8a4a5e10>, 'sampler': <helpers.multiaspect.sampler.MultiAspectSampler object at 0x7fcb8a4a56f0>, 'train_dataloader': <torch.utils.data.dataloader.DataLoader object at 0x7fcb8a4a68f0>, 'text_embed_cache': <helpers.caching.text_embeds.TextEmbeddingCache object at 0x7fcb8ab1e2c0>, 'vaecache': <helpers.caching.vae.VAECache object at 0x7fcb8981fd30>}
2024-08-06 01:46:57,451 [INFO] (validation) Precomputing the negative prompt embed for validations.
2024-08-06 01:46:57,608 [INFO] (__main__) Unloading text encoders, as they are not being trained.
2024-08-06 01:46:58,565 [INFO] (__main__) After nuking text encoders from orbit, we freed 9.11 GB of VRAM. The real memories were the friends we trained a model on along the way.
Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 11135.32it/s]
Traceback (most recent call last):
  File "/home/vwing/Documents/SimpleTuner/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/vwing/Documents/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/vwing/Documents/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/home/vwing/Documents/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/vwing/Documents/SimpleTuner/.venv/bin/python', 'train.py', '--model_type=lora', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--enable_xformers_memory_efficient_attention', '--gradient_checkpointing', '--set_grads_to_none', '--gradient_accumulation_steps=4', '--resume_from_checkpoint=latest', '--snr_gamma=5', '--data_backend_config=config/multidatabackend.json', '--num_train_epochs=0', '--max_train_steps=30000', '--metadata_update_interval=65', '--adam_bfloat16', '--learning_rate=8e-7', '--lr_scheduler=sine', '--seed', '42', '--lr_warmup_steps=1000', '--output_dir=output/models', '--inference_scheduler_timestep_spacing=trailing', '--training_scheduler_timestep_spacing=trailing', '--report_to=wandb', '--allow_tf32', '--mixed_precision=bf16', '--base_model_precision=int8-quanto', '--lora_rank=4', '--flux', '--train_batch=10', '--max_workers=32', '--read_batch_size=25', '--write_batch_size=64', '--caption_dropout_probability=0.1', '--torch_num_threads=8', '--image_processing_batch_size=32', '--vae_batch_size=12', '--validation_prompt=ethnographic photography of teddy bear at a picnic', '--num_validation_images=1', '--validation_num_inference_steps=30', '--validation_seed=42', '--minimum_image_size=1024', '--resolution=1024', '--validation_resolution=1024x1024', '--resolution_type=pixel', '--checkpointing_steps=150', '--checkpoints_total_limit=2', '--validation_steps=100', '--tracker_run_name=simpletuner-sdxl', '--tracker_project_name=sdxl-training', '--validation_guidance=3', '--validation_guidance_rescale=0.0', '--validation_negative_prompt=blurry, cropped, ugly']' died with <Signals.SIGKILL: 9>.

Does this error mean that the training ran out of VRAM / RAM along the way? I'm not sure what to make of the error. I'm using WSL. I believe I am running the bugfix/quanto-lora-loading branch. Pip freeze:

absl-py==2.1.0
accelerate==0.31.0
aiohttp==3.9.3
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.42.0
boto3==1.34.79
botocore==1.34.79
build==1.2.1
CacheControl==0.14.0
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
cleo==2.1.0
click==8.1.7
clip-interrogator==0.6.0
colorama==0.4.6
compel==2.0.2
crashtest==0.4.1
cryptography==43.0.0
dadaptation==3.2
datasets==2.14.4
deepspeed==0.10.3
diffusers @ git+https://github.com/huggingface/diffusers@15924bc73bfd74c769f23c8d2636d6c7514163a0
dill==0.3.7
distlib==0.3.8
docker-pycreds==0.4.0
dulwich==0.21.7
fastjsonschema==2.20.0
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
ftfy==6.2.0
gitdb==4.0.11
GitPython==3.1.43
grpcio==1.62.1
hjson==3.1.0
huggingface-hub==0.23.2
idna==3.6
importlib_metadata==7.1.0
installer==0.7.0
iterutils==0.1.6
jaraco.classes==3.4.0
jeepney==0.8.0
Jinja2==3.1.3
jmespath==1.0.1
keyring==24.3.1
lightning-utilities==0.11.2
Markdown==3.6
MarkupSafe==2.1.5
more-itertools==10.3.0
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.15
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
open-clip-torch==2.24.0
opencv-python==4.9.0.80
optimum-quanto==0.2.2
packaging==24.0
pandas==2.2.1
peft==0.9.0
pexpect==4.9.0
pillow==10.3.0
pkginfo==1.11.1
platformdirs==4.2.2
poetry==1.8.3
poetry-core==1.9.0
poetry-plugin-export==1.8.0
prodigyopt==1.0
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
py-cpuinfo==9.0.0
pyarrow==15.0.2
pycparser==2.22
pydantic==1.10.15
pyparsing==3.1.2
pyproject_hooks==1.1.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
rapidfuzz==3.9.5
regex==2023.12.25
requests==2.31.0
requests-toolbelt==1.0.0
s3transfer==0.10.1
safetensors==0.4.2
scipy==1.13.0
SecretStorage==3.3.3
sentencepiece==0.2.0
sentry-sdk==1.44.1
setproctitle==1.3.3
shellingham==1.5.4
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.16.2
tensorboard-data-server==0.7.2
timm==0.9.16
tokenizers==0.19.1
tomli==2.0.1
tomlkit==0.13.0
torch==2.3.0+cu121
torchaudio==2.3.0+cu121
torchmetrics==1.3.2
torchsde==0.2.6
torchvision==0.18.0+cu121
tqdm==4.66.2
trampoline==0.1.2
transformers==4.42.4
triton==2.3.0
triton-library==1.0.0rc3
trove-classifiers==2024.7.2
typing_extensions==4.11.0
tzdata==2024.1
urllib3==1.26.18
virtualenv==20.26.3
wandb==0.16.6
wcwidth==0.2.13
Werkzeug==3.0.2
xformers==0.0.26.post1
xxhash==3.4.1
yarl==1.9.4
zipp==3.18.1

config.env :

# Configure these values.

# 'lora' or 'full'
# lora - train a small network for a character or style, or both. quite versatile.
# full - requires lots of vram, trains very slowly, needs a lot of data and concepts.
export MODEL_TYPE='lora'

# Set this to 'true' if you are training a Stable Diffusion 3 checkpoint.
# Use MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export STABLE_DIFFUSION_3=false
# Similarly, this is to train PixArt Sigma (1K or 2K) models.
# Use MODEL_NAME="PixArt-alpha/PixArt-Sigma-XL-2-1024-MS"
export PIXART_SIGMA=false
# For old Stable Diffusion 1.x/2.x models, you'll enable this.
# Use MODEL_NAME="stabilityai/stable-diffusion-2-1"
export STABLE_DIFFUSION_LEGACY=false
# For Kwai-Kolors, enable KOLORS.
# Use MODEL_NAME="kwai-kolors/kolors-diffusers"
export KOLORS=false
# For Flux, if you have 8 GPUs and DeepSpeed configured.
# Use MODEL_NAME="black-forest-labs/FLUX.1-dev"
export FLUX=true

# ControlNet model training is only supported when MODEL_TYPE='full'
# See this document for more information: https://github.com/bghira/SimpleTuner/blob/main/documentation/CONTROLNET.md
# DeepFloyd, PixArt, and SD3 do not currently support ControlNet model training.
export CONTROLNET=false

# DoRA enhances the training style of LoRA, but it will run more slowly at the same rank.
# See: https://arxiv.org/abs/2402.09353
# See: https://github.com/huggingface/peft/pull/1474
export USE_DORA=false

# BitFit freeze strategy for the u-net causes everything but the biases to be frozen.
# This may help retain the full model's underlying capabilities. LoRA is currently not tested/known to work.
#if [[ "$MODEL_TYPE" == "full" ]]; then
#    # When training a full model, we will rely on BitFit to keep the u-net intact.
#    export USE_BITFIT=true
#elif [[ "$MODEL_TYPE" == "lora" ]]; then
#    # LoRA can not use BitFit.
#    export USE_BITFIT=false
#elif [[ "$MODEL_TYPE" == "deepfloyd-full" ]]; then
#    export USE_BITFIT=true
#fi

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.
# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=150
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=2

# This is decided as a relatively conservative 'constant' learning rate.
# Adjust higher or lower depending on how burnt your model becomes.
export LEARNING_RATE=8e-7 #@param {type:"number"}

# Using a Huggingface Hub model:
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

# Make DEBUG_EXTRA_ARGS empty to disable wandb.
export DEBUG_EXTRA_ARGS="--report_to=wandb"
export TRACKER_PROJECT_NAME="sdxl-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"

# Max number of steps OR epochs can be used. Not both.
export MAX_NUM_STEPS=30000
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=0

# A convenient prefix for all of your training paths.
# These may be absolute or relative paths. Here, we are using relative paths.
# The output will just be in a folder called "output/models" by default.
export DATALOADER_CONFIG="config/multidatabackend.json"
export OUTPUT_DIR="output/models"

# Set this to "true" to push your model to Hugging Face Hub.
export PUSH_TO_HUB="false"
# If PUSH_TO_HUB and PUSH_CHECKPOINTS are both enabled, every saved checkpoint will be pushed to Hugging Face Hub.
export PUSH_CHECKPOINTS="true"
# This will be the model name for your final hub upload, eg. "yourusername/yourmodelname"
# It defaults to the wandb project name, but you can override this here.
export HUB_MODEL_NAME=$TRACKER_PROJECT_NAME

# By default, images will be resized so their SMALLER EDGE is 1024 pixels, maintaining aspect ratio.
# Setting this value to 768px might result in more reasonable training data sizes for SDXL.
export RESOLUTION=1024
# If you want to have the training data resized by pixel area (Megapixels) rather than edge length,
#  set this value to "area" instead of "pixel", and uncomment the next RESOLUTION declaration.
export RESOLUTION_TYPE="pixel"
#export RESOLUTION=1          # 1.0 Megapixel training sizes
# If RESOLUTION_TYPE="pixel", the minimum resolution specifies the smaller edge length, measured in pixels. Recommended: 1024.
# If RESOLUTION_TYPE="area", the minimum resolution specifies the total image area, measured in megapixels. Recommended: 1.
export MINIMUM_RESOLUTION=$RESOLUTION

# How many decimals to round aspect buckets to.
#export ASPECT_BUCKET_ROUNDING=2

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# If you also supply a user prompt library or `--use_prompt_library`, this will be added to those lists.
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=3
# You'll want to set this to 0.7 if you are training a terminal SNR model.
export VALIDATION_GUIDANCE_RESCALE=0.0
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100
export VALIDATION_NUM_INFERENCE_STEPS=30
export VALIDATION_NEGATIVE_PROMPT="blurry, cropped, ugly"
export VALIDATION_SEED=42
export VALIDATION_RESOLUTION=1024x1024


# Adjust this for your GPU memory size. This, and resolution, are the biggest VRAM killers.
export TRAIN_BATCH_SIZE=10
# Accumulate your update gradient over many steps, to save VRAM while still having higher effective batch size:
# effective batch size = ($TRAIN_BATCH_SIZE * $GRADIENT_ACCUMULATION_STEPS).
export GRADIENT_ACCUMULATION_STEPS=4

# Use any standard scheduler type. constant, polynomial, constant_with_warmup
export LR_SCHEDULE="sine"
# A warmup period allows the model and the EMA weights more importantly to familiarise itself with the current quanta.
# For the cosine or sine type schedules, the warmup period defines the interval between peaks or valleys.
# Use a sine schedule to simulate a warmup period, or a Cosine period to simulate a polynomial start.
#export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export LR_WARMUP_STEPS=1000

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
# You may wish to disable dropout if you want to limit your changes strictly to the prompts you show the model.
# You may wish to increase the rate of dropout if you want to more broadly adopt your changes across the model.
export CAPTION_DROPOUT_PROBABILITY=0.1

export METADATA_UPDATE_INTERVAL=65

# How many workers to use for VAE caching.
export MAX_WORKERS=32
# Read and write batch sizes for VAE caching.
export READ_BATCH_SIZE=25
export WRITE_BATCH_SIZE=64
# How many images to encode at once with the VAE. Can increase VRAM use.
export VAE_BATCH_SIZE=12
# How many images to process at once (resize, crop, transform) during VAE caching.
export IMAGE_PROCESSING_BATCH_SIZE=32
# When using large batch sizes, you'll need to increase the pool connection limit.
export AWS_MAX_POOL_CONNECTIONS=128
# For very large systems, setting this can reduce CPU overhead of torch spawning an unnecessarily large number of threads.
export TORCH_NUM_THREADS=8

# If this is set, any images that fail to open will be DELETED to avoid re-checking them every time.
export DELETE_ERRORED_IMAGES=0
# If this is set, any images that are too small for the minimum resolution size will be DELETED.
export DELETE_SMALL_IMAGES=0

# Bytedance recommends these be set to "trailing" so that inference and training behave in a more congruent manner.
# To follow the original SDXL training strategy, use "leading" instead, though results are generally worse.
export TRAINING_SCHEDULER_TIMESTEP_SPACING="trailing"
export INFERENCE_SCHEDULER_TIMESTEP_SPACING="trailing"

# Removing this option or unsetting it uses vanilla training. Setting it reweights the loss by the position of the timestep in the noise schedule.
# A value "5" is recommended by the researchers. A value of "20" is the least impact, and "1" is the most impact.
export MIN_SNR_GAMMA=5

# Set this to an explicit value of "false" to disable Xformers. Probably required for AMD users.
export USE_XFORMERS=true

# There's basically no reason to unset this. However, to disable it, use an explicit value of "false".
# This will save a lot of memory consumption when enabled.
export USE_GRADIENT_CHECKPOINTING=true

##
# Options below here may require a bit more complicated configuration, so they are not simple variables.
##

# TF32 is great on Ampere or Ada, not sure about earlier generations.
export ALLOW_TF32=true
# AdamW 8Bit is a robust and lightweight choice. Adafactor might reduce memory consumption, and Dadaptation is slow and experimental.
# AdamW is the default optimizer, but it uses a lot of memory and is slower than AdamW8Bit or Adafactor.
# When training a quantised base model, you can't use adamw_bf16. Instead, try adafactor or adamw.
# Choices: adamw, adamw8bit, adafactor, dadaptation, adamw_bf16
export OPTIMIZER="adamw_bf16"


# EMA is a strong regularisation method that uses a lot of extra VRAM to hold two copies of the weights.
# This is worthwhile on large training runs, but not so much for smaller training runs.
export USE_EMA=false
export EMA_DECAY=0.999

export TRAINER_EXTRA_ARGS="--base_model_precision=int8-quanto --lora_rank=4"
## For offset noise training:
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --text_encoder_1_precision=no_change --text_encoder_2_precision=no_change"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=trailing --inference_scheduler_timestep_spacing=trailing"
## You may benefit from directing training toward a specific weighted subset of timesteps.
# In this example, we train the final 25% of the timestep schedule with a 3x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=later --timestep_bias_portion=0.25 --timestep_bias_multiplier=3"
# In this example, we train the earliest 25% of the timestep schedule with a 5x bias.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=earlier --timestep_bias_portion=0.25 --timestep_bias_multiplier=5"
# Here, we designate that specifically, timesteps 200 to 500 should be prioritised.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --timestep_bias_strategy=range --timestep_bias_begin=200 --timestep_bias_end=500 --timestep_bias_multiplier=3"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
# Not recommended for terminal SNR models.
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

# For Wasabi S3 filesystem backend (experimental)
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --data_backend=aws --aws_bucket_name=test123"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_endpoint_url=https://s3.wasabisys.com"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_access_key=1234567890"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --aws_secret_access_key=0987654321"


# Reproducible training. Set to -1 to disable.
export TRAINING_SEED=42

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export PURE_BF16=true

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=1
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile. Seriously, just don't do it.
# Well, then again... Pytorch 2.2 has support for dynamic shapes. Why not?
export TRAINING_DYNAMO_BACKEND='no'                # or 'no' if you want to disable torch compile in case of performance issues or lack of support (eg. AMD)

export TOKENIZERS_PARALLELISM=false

multidatabackend.json :

[
  {
    "id": "pseudo-camera-10k-flux",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "center",
    "resolution": 1.0,
    "minimum_image_size": 0.5,
    "maximum_image_size": 1.0,
    "target_downsample_size": 1.0,
    "resolution_type": "area",
    "cache_dir_vae": "cache/vae/flux/pseudo-camera-10k",
    "instance_data_dir": "datasets/pseudo-camera-10k",
    "disabled": false,
    "skip_file_discovery": "",
    "caption_strategy": "filename",
    "metadata_backend": "json"
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/flux/pseudo-camera-10k",
    "disabled": false,
    "write_batch_size": 128
  }
]

I'd appreciate any and all help getting this running. Thank you.

Shed-The-Skin avatar Aug 06 '24 14:08 Shed-The-Skin