SimpleTuner Strange issues using local model

I'm on a machine where i do not have often access to internet and there are strange behaviors when I try to run in local. Hardware is 3080

At some point I had this error:

2024-08-17 10:38:36,223 [INFO] (__main__) Load tokenizers
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'CLIPTokenizer'. 
The class this function is called from is 'T5Tokenizer'.
2024-08-17 10:38:36,535 [WARNING] (helpers.training.text_encoding) Could not load secondary tokenizer (T5 XXL). Cannot continue: not a string
not a string

But now it is not doing it anymore but I have not understood what I changed.

Now the Pre-computing null embedding is extremely slow but i pases it more importantly Initialize text embed pre-computation > 1000s/it

so it would take more than 160 days to complete!!!!!!

Here is the log:


/home/bidilun/github/SimpleTuner/.venv/lib/python3.11/site-packages/nvidia/nvjitlink/lib
DEBUG_EXTRA_ARGS not set, defaulting to empty.
2024-08-17 11:13:16,164 [WARNING] (ArgsParser) The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-08-17 11:13:16,164 [INFO] (ArgsParser) VAE Model: models/FLUX.1-dev
2024-08-17 11:13:16,164 [INFO] (ArgsParser) Default VAE Cache location: 
2024-08-17 11:13:16,164 [INFO] (ArgsParser) Text Cache location: cache
2024-08-17 11:13:16,164 [WARNING] (ArgsParser) Updating T5 XXL tokeniser max length to 512 for Flux.
2024-08-17 11:13:16,164 [WARNING] (ArgsParser) Gradient accumulation steps are enabled, but gradient precision is set to 'unmodified'. This may lead to numeric instability. Consider disabling gradient accumulation steps. Continuing in 10 seconds..
2024-08-17 11:13:26,184 [ERROR] (__main__) Failed to log into Hugging Face Hub: Token is required (`token=True`), but no token found. You need to provide a token or be logged in to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens.
2024-08-17 11:13:26,185 [INFO] (__main__) Load tokenizers
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
2024-08-17 11:13:26,412 [INFO] (helpers.training.text_encoding) Loading OpenAI CLIP-L text encoder from models/FLUX.1-dev/text_encoder..
2024-08-17 11:13:26,434 [INFO] (helpers.training.text_encoding) Loading T5 XXL v1.1 text encoder from models/FLUX.1-dev/text_encoder_2..
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  9.08it/s]
2024-08-17 11:13:28,910 [INFO] (__main__) Load VAE: models/FLUX.1-dev
2024-08-17 11:13:28,976 [INFO] (__main__) Moving text encoder to GPU.
2024-08-17 11:13:28,977 [INFO] (__main__) Moving text encoder 2 to GPU.
2024-08-17 11:13:28,980 [INFO] (__main__) Loading VAE onto accelerator, converting from torch.float32 to torch.bfloat16
2024-08-17 11:13:29,002 [INFO] (DataBackendFactory) Loading data backend config from config/multidatabackend.json
2024-08-17 11:13:29,002 [INFO] (DataBackendFactory) Configuring text embed backend: text-embeds
Loading pipeline components...:   0%|                                                                                                                                                                  | 0/5 [00:00<?, ?it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of models/FLUX.1-dev.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1317.97it/s]
2024-08-17 11:13:29,009 [INFO] (TextEmbeddingCache) (Rank: 0) (id=text-embeds) Listing all text embed cache entries
2024-08-17 11:13:29,009 [INFO] (DataBackendFactory) Pre-computing null embedding
2024-08-17 11:32:47,020 [INFO] (DataBackendFactory) Completed loading text embed services.                                   
2024-08-17 11:32:47,021 [INFO] (DataBackendFactory) Configuring data backend: pseudo-camera-10k-flux
2024-08-17 11:32:47,021 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Loading bucket manager.                      
2024-08-17 11:32:47,022 [INFO] (JsonMetadataBackend) Checking for cache file: datasets/pseudo-camera-10k/aspect_ratio_bucket_indices.json
2024-08-17 11:32:47,022 [WARNING] (JsonMetadataBackend) No cache file found, creating new one.
2024-08-17 11:32:47,022 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Refreshing aspect buckets on main process.
2024-08-17 11:32:47,022 [INFO] (BaseMetadataBackend) Discovering new files...
2024-08-17 11:33:03,844 [INFO] (BaseMetadataBackend) Compressed 0 existing files from 0.
2024-08-17 11:36:11,472 [INFO] (BaseMetadataBackend) Image processing statistics: {'total_processed': 12926, 'skipped': {'already_exists': 0, 'metadata_missing': 0, 'not_found': 0, 'too_small': 0, 'other': 0}}
2024-08-17 11:36:11,531 [INFO] (BaseMetadataBackend) Enforcing minimum image size of 512. This could take a while for very-large datasets.
2024-08-17 11:36:11,553 [INFO] (BaseMetadataBackend) Completed aspect bucket update.                                                                                                                                         
2024-08-17 11:36:11,568 [INFO] (DataBackendFactory) Configured backend: {'id': 'pseudo-camera-10k-flux', 'config': {'crop': True, 'crop_aspect': 'square', 'crop_aspect_buckets': None, 'crop_style': 'center', 'disable_validation': False, 'resolution': 512, 'resolution_type': 'pixel', 'caption_strategy': 'filename', 'instance_data_dir': 'datasets/pseudo-camera-10k', 'maximum_image_size': 512, 'target_downsample_size': 512, 'config_version': 2}, 'dataset_type': 'image', 'data_backend': <helpers.data_backend.local.LocalDataBackend object at 0x7ccf6beccc50>, 'instance_data_dir': 'datasets/pseudo-camera-10k', 'metadata_backend': <helpers.metadata.backends.json.JsonMetadataBackend object at 0x7ccf6bece1d0>}
(Rank: 0)  | Bucket     | Image Count (per-GPU)
------------------------------
(Rank: 0)  | 1.0        | 13926       
2024-08-17 11:36:11,569 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Collecting captions.
2024-08-17 11:36:11,611 [INFO] (DataBackendFactory) (id=pseudo-camera-10k-flux) Initialise text embed pre-computation using the filename caption strategy. We have 14102 captions to process.
Write embeds to disk:   0%|                                                         | 3/14102 [57:29<4499:44:06, 1148.95s/it]

Processing prompts:   0%|                                                           | 3/14102 [57:29<4499:29:35, 1148.89s/it]

the conf.env is as follows:

RESUME_CHECKPOINT='latest'
DATALOADER_CONFIG='config/multidatabackend.json'
ASPECT_BUCKET_ROUNDING='2'
TRAINING_SEED='42'
USE_EMA='false'
USE_XFORMERS='false'
MINIMUM_RESOLUTION='0'
OUTPUT_DIR='output/models'
USE_DORA='false'
USE_BITFIT='false'
PUSH_TO_HUB='false'
PUSH_CHECKPOINTS='false'
MAX_NUM_STEPS='1000'
NUM_EPOCHS='0'
CHECKPOINTING_STEPS='50'
CHECKPOINTING_LIMIT='5'
DEBUG_EXTRA_ARGS=''
MODEL_TYPE='lora'
MODEL_NAME='models/FLUX.1-dev'
FLUX='true'
KOLORS='false'
STABLE_DIFFUSION_3='false'
STABLE_DIFFUSION_LEGACY='false'
FLUX_LORA_TARGET='all'
TRAIN_BATCH_SIZE='1'
USE_GRADIENT_CHECKPOINTING='true'
GRADIENT_ACCUMULATION_STEPS='2'
CAPTION_DROPOUT_PROBABILITY='0.1'
RESOLUTION_TYPE='area'
RESOLUTION='1.0'
VALIDATION_SEED='42'
VALIDATION_STEPS='50'
VALIDATION_RESOLUTION='1024x1024'
VALIDATION_GUIDANCE='7.5'
VALIDATION_GUIDANCE_RESCALE='0.0'
VALIDATION_NUM_INFERENCE_STEPS='20'
VALIDATION_PROMPT='A photo-realistic image of a cat'
ALLOW_TF32='false'
MIXED_PRECISION='bf16'
OPTIMIZER='adamw_bf16'
LEARNING_RATE='8e-5'
LR_SCHEDULE='polynomial'
LR_WARMUP_STEPS='100'
ACCELERATE_EXTRA_ARGS=''
TRAINING_NUM_PROCESSES='1'
TRAINING_NUM_MACHINES='1'
VALIDATION_TORCH_COMPILE='false'
TRAINER_DYNAMO_BACKEND='no'
TRAINER_EXTRA_ARGS='--lora_rank=64 --lr_end=1e-8 --compress_disk_cache'

and the multidatabackend is here

   {
      "id": "pseudo-camera-10k-flux",
      "type": "local",
      "crop": true,
      "crop_aspect": "square",
      "crop_style": "center",
      "resolution": 512,
      "minimum_image_size": 512,
      "maximum_image_size": 512,
      "target_downsample_size": 512,
      "resolution_type": "pixel",
      "cache_dir_vae": "cache/vae/flux/pseudo-camera-10k",
      "instance_data_dir": "datasets/pseudo-camera-10k",
      "disabled": false,
      "skip_file_discovery": "",
      "caption_strategy": "filename",
      "metadata_backend": "json"
    },
    {
      "id": "text-embeds",
      "type": "local",
      "dataset_type": "text_embeds",
      "default": true,
      "cache_dir": "cache/text/flux/pseudo-camera-10k",
      "disabled": false,
      "write_batch_size": 128
    }
  ]

Aug 17 '24 10:08 bidilun

It's probably running on CPU instead of GPU.

Also, RTX 3080 doesn't have enough VRAM to train a Flux LoRA without quantization. Most RTX 3080 cards are either 10 GB or 12 GB. You would need int4 quantization to squeeze the model down to that size. Or maybe wait and see if NF4 quantization support eventually lands here.

Aug 17 '24 11:08 mhirki

sorry I put the wrong number it is a 3090 with 24GB so normally OK for flux and I have a second older Nvidia card with 8Gb on the machine but that should not affects things

Aug 17 '24 11:08 bidilun

So this is a multi-GPU machine? That could also be causing these issues if it's trying to use both GPUs and the slower GPU is holding back everything. You can run accelerate config to tell it to use only one GPU.

And yes, RTX 3090 is better but you still need either fp8 or int8 quantization.

Aug 17 '24 11:08 mhirki

And since you don't always have internet access, you should probably run wandb offline. Or alternatively, just switch to tensorboard which works locally.

Aug 17 '24 11:08 mhirki

not using wandb or tensorboard but maybe I will try tensorboard if I manage to pass this embed pre-computation there is only using one core 100% and not doing much on the GPU

Aug 17 '24 15:08 bidilun

One CPU core at 100% is normal. GPU should be busy when pre-computing the text embeds. There's probably something wrong with your system specifically.

SimpleTuner is using CUDA 12.4 so the minimum Linux driver version is 550.54.14. https://docs.nvidia.com/cuda/archive/12.4.0/cuda-toolkit-release-notes/

Aug 17 '24 16:08 mhirki