StabilityMatrix icon indicating copy to clipboard operation
StabilityMatrix copied to clipboard

FluxGym - freezes after a few seconds (with no GPU activity)

Open mariano-mena opened this issue 9 months ago • 3 comments

Package

FluxGym , commit: main@284ddc7

When did the issue occur?

Running the Package

What GPU / hardware type are you using?

RTX 4070ti Super 16Gb

What happened?

When I press "Train", after doing all the preparation (including VAE, SFT and CLIP files), a few lines are executed in the console and then the program hangs forever without showing any activity in the TaskManager. (The Forge program works fine.)

Also, when launching FluxGym, its Stability Matrix text windows says "The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable." (manual installing bitsandbytes-windows did not fixed the subject of this report)

Console output

[2025-03-11 10:23:18] [INFO] Running d:\StabilityMatrix\Packages\FluxGym\outputs\fdsf-jk-kljl\train.bat [2025-03-11 10:23:18] [INFO] [2025-03-11 10:23:18] [INFO] d:\StabilityMatrix\Packages\FluxGym>accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 sd-scripts/flux_train_network.py --pretrained_model_name_or_path "d:\StabilityMatrix\Packages\FluxGym\models\unet\flux1-dev.sft" --clip_l "d:\StabilityMatrix\Packages\FluxGym\models\clip\clip_l.safetensors" --t5xxl "d:\StabilityMatrix\Packages\FluxGym\models\clip\t5xxl_fp16.safetensors" --ae "d:\StabilityMatrix\Packages\FluxGym\models\vae\ae.sft" --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --lr_scheduler constant_with_warmup --max_grad_norm 0.0 --sample_prompts="d:\StabilityMatrix\Packages\FluxGym\outputs\fdsf-jk-kljl\sample_prompts.txt" --sample_every_n_steps="300" --learning_rate 8e-4 --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 16 --save_every_n_epochs 4 --dataset_config "d:\StabilityMatrix\Packages\FluxGym\outputs\fdsf-jk-kljl\dataset.toml" --output_dir "d:\StabilityMatrix\Packages\FluxGym\outputs\fdsf-jk-kljl" --output_name fdsf-jk-kljl --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1 --loss_type l2 [2025-03-11 10:23:24] [INFO] 2025-03-11 10:23:24 INFO highvram is enabled / train_util.py:4292 [2025-03-11 10:23:24] [INFO] highvramが有効です [2025-03-11 10:23:24] [INFO] WARNING cache_latents_to_disk is train_util.py:4309 [2025-03-11 10:23:24] [INFO] enabled, so cache_latents is [2025-03-11 10:23:24] [INFO] also enabled / [2025-03-11 10:23:24] [INFO] cache_latents_to_diskが有効なた [2025-03-11 10:23:24] [INFO] め、cache_latentsを有効にします [2025-03-11 10:23:24] [INFO] 2025-03-11 10:23:24 INFO Checking the state dict: flux_utils.py:43 [2025-03-11 10:23:24] [INFO] Diffusers or BFL, dev or schnell [2025-03-11 10:23:24] [INFO] INFO t5xxl_max_token_length: flux_train_network.py:157 [2025-03-11 10:23:24] [INFO] 512 [2025-03-11 10:23:24] [INFO] d:\StabilityMatrix\Packages\FluxGym\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 [2025-03-11 10:23:24] [INFO] warnings.warn( [2025-03-11 10:23:24] [INFO] You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [2025-03-11 10:23:24] [INFO] INFO Loading dataset config from train_network.py:488 [2025-03-11 10:23:24] [INFO] d:\StabilityMatrix\Packages\F [2025-03-11 10:23:24] [INFO] luxGym\outputs\fdsf-jk-kljl\d [2025-03-11 10:23:24] [INFO] ataset.toml [2025-03-11 10:23:24] [INFO] INFO prepare images. train_util.py:2049 [2025-03-11 10:23:24] [INFO] INFO get image size from name of train_util.py:1942 [2025-03-11 10:23:24] [INFO] cache files [2025-03-11 10:23:24] [INFO] 0%| | 0/5 [00:00<?, ?it/s] 100%|██████████| 5/5 [00:00<?, ?it/s] [2025-03-11 10:23:24] [INFO] INFO set image size from cache train_util.py:1972 [2025-03-11 10:23:24] [INFO] files: 0/5 [2025-03-11 10:23:24] [INFO] INFO found directory train_util.py:1996 [2025-03-11 10:23:24] [INFO] d:\StabilityMatrix\Packages\Flu [2025-03-11 10:23:24] [INFO] xGym\datasets\fdsf-jk-kljl [2025-03-11 10:23:24] [INFO] contains 5 image files [2025-03-11 10:23:24] [INFO] read caption: 0%| | 0/5 [00:00<?, ?it/s] read caption: 100%|██████████| 5/5 [00:00<00:00, 755.68it/s] [2025-03-11 10:23:24] [INFO] INFO 50 train images with repeats. train_util.py:2092 [2025-03-11 10:23:24] [INFO] INFO 0 reg images with repeats. train_util.py:2096 [2025-03-11 10:23:24] [INFO] WARNING no regularization images / train_util.py:2101 [2025-03-11 10:23:24] [INFO] 正則化画像が見つかりませんでし [2025-03-11 10:23:24] [INFO] た [2025-03-11 10:23:24] [INFO] INFO [Dataset 0] config_util.py:575 [2025-03-11 10:23:24] [INFO] batch_size: 1 [2025-03-11 10:23:24] [INFO] resolution: (512, 512) [2025-03-11 10:23:24] [INFO] enable_bucket: False [2025-03-11 10:23:24] [INFO] [2025-03-11 10:23:24] [INFO] [Subset 0 of Dataset 0] [2025-03-11 10:23:24] [INFO] image_dir: [2025-03-11 10:23:24] [INFO] "d:\StabilityMatrix\Packages\Fl [2025-03-11 10:23:24] [INFO] uxGym\datasets\fdsf-jk-kljl" [2025-03-11 10:23:24] [INFO] image_count: 5 [2025-03-11 10:23:24] [INFO] num_repeats: 10 [2025-03-11 10:23:24] [INFO] shuffle_caption: False [2025-03-11 10:23:24] [INFO] keep_tokens: 1 [2025-03-11 10:23:24] [INFO] caption_dropout_rate: 0.0 [2025-03-11 10:23:24] [INFO] caption_dropout_every_n_epo [2025-03-11 10:23:24] [INFO] chs: 0 [2025-03-11 10:23:24] [INFO] caption_tag_dropout_rate: [2025-03-11 10:23:24] [INFO] 0.0 [2025-03-11 10:23:24] [INFO] caption_prefix: None [2025-03-11 10:23:24] [INFO] caption_suffix: None [2025-03-11 10:23:24] [INFO] color_aug: False [2025-03-11 10:23:24] [INFO] flip_aug: False [2025-03-11 10:23:24] [INFO] face_crop_aug_range: None [2025-03-11 10:23:24] [INFO] random_crop: False [2025-03-11 10:23:24] [INFO] token_warmup_min: 1, [2025-03-11 10:23:24] [INFO] token_warmup_step: 0, [2025-03-11 10:23:24] [INFO] alpha_mask: False [2025-03-11 10:23:24] [INFO] custom_attributes: {} [2025-03-11 10:23:24] [INFO] is_reg: False [2025-03-11 10:23:24] [INFO] class_tokens: [2025-03-11 10:23:24] [INFO] sdfjklsdkfjsdlkf [2025-03-11 10:23:24] [INFO] caption_extension: .txt [2025-03-11 10:23:24] [INFO] [2025-03-11 10:23:24] [INFO] [2025-03-11 10:23:24] [INFO] INFO [Prepare dataset 0] config_util.py:587 [2025-03-11 10:23:24] [INFO] INFO loading image sizes. train_util.py:970 [2025-03-11 10:23:24] [INFO] 0%| | 0/5 [00:00<?, ?it/s] 100%|██████████| 5/5 [00:00<00:00, 333.63it/s] [2025-03-11 10:23:24] [INFO] INFO prepare dataset train_util.py:995 [2025-03-11 10:23:24] [INFO] INFO preparing accelerator train_network.py:562 [2025-03-11 10:23:24] [INFO] accelerator device: cpu [2025-03-11 10:23:24] [INFO] INFO Checking the state dict: flux_utils.py:43 [2025-03-11 10:23:24] [INFO] Diffusers or BFL, dev or schnell [2025-03-11 10:23:24] [INFO] INFO Building Flux model dev from BFL flux_utils.py:101 [2025-03-11 10:23:24] [INFO] checkpoint [2025-03-11 10:23:24] [INFO] INFO Loading state dict from flux_utils.py:118 [2025-03-11 10:23:24] [INFO] d:\StabilityMatrix\Packages\Flux [2025-03-11 10:23:24] [INFO] Gym\models\unet\flux1-dev.sft [2025-03-11 10:23:24] [INFO] INFO Loaded Flux: <All keys matched flux_utils.py:137 [2025-03-11 10:23:24] [INFO] successfully> [2025-03-11 10:23:24] [INFO] INFO Cast FLUX model to fp8. flux_train_network.py:108 [2025-03-11 10:23:24] [INFO] This may take a while. [2025-03-11 10:23:24] [INFO] You can reduce the time [2025-03-11 10:23:24] [INFO] by using fp8 checkpoint. [2025-03-11 10:23:24] [INFO] / [2025-03-11 10:23:24] [INFO] FLUXモデルをfp8に変換し [2025-03-11 10:23:24] [INFO] ています。これには時間が [2025-03-11 10:23:24] [INFO] かかる場合があります。fp [2025-03-11 10:23:24] [INFO] 8チェックポイントを使用 [2025-03-11 10:23:24] [INFO] することで時間を短縮でき [2025-03-11 10:23:24] [INFO] ます。 [2025-03-11 10:24:05] [INFO] 2025-03-11 10:24:05 INFO Building CLIP-L flux_utils.py:179 [2025-03-11 10:24:05] [INFO] INFO Loading state dict from flux_utils.py:275 [2025-03-11 10:24:05] [INFO] d:\StabilityMatrix\Packages\Flux [2025-03-11 10:24:05] [INFO] Gym\models\clip\clip_l.safetenso [2025-03-11 10:24:05] [INFO] rs [2025-03-11 10:24:06] [INFO] 2025-03-11 10:24:06 INFO Loaded CLIP-L: <All keys matched flux_utils.py:278 [2025-03-11 10:24:06] [INFO] successfully> [2025-03-11 10:24:06] [INFO] INFO Loading state dict from flux_utils.py:330 [2025-03-11 10:24:06] [INFO] d:\StabilityMatrix\Packages\Flux [2025-03-11 10:24:06] [INFO] Gym\models\clip\t5xxl_fp16.safet [2025-03-11 10:24:06] [INFO] ensors [2025-03-11 10:24:06] [INFO] INFO Loaded T5xxl: <All keys matched flux_utils.py:333 [2025-03-11 10:24:06] [INFO] successfully> [2025-03-11 10:24:06] [INFO] INFO Building AutoEncoder flux_utils.py:144 [2025-03-11 10:24:06] [INFO] INFO Loading state dict from flux_utils.py:149 [2025-03-11 10:24:06] [INFO] d:\StabilityMatrix\Packages\Flux [2025-03-11 10:24:06] [INFO] Gym\models\vae\ae.sft [2025-03-11 10:24:06] [INFO] INFO Loaded AE: <All keys matched flux_utils.py:152 [2025-03-11 10:24:06] [INFO] successfully> [2025-03-11 10:24:06] [INFO] import network module: networks.lora_flux [2025-03-11 10:24:06] [INFO] INFO [Dataset 0] train_util.py:2585 [2025-03-11 10:24:06] [INFO] INFO caching latents with caching train_util.py:1095 [2025-03-11 10:24:06] [INFO] strategy. [2025-03-11 10:24:06] [INFO] INFO caching latents... train_util.py:1144

Version

v.2.13.4

What Operating System are you using?

Windows

mariano-mena avatar Mar 11 '25 13:03 mariano-mena

Click the 3 dots next to fluxgem in launcher then go to "python packages". Search for "bitsandbytes" and select 0.45.3 from the dropdown and press the blue icon to update. Now run it again should work.

EpicMulletGuy avatar Mar 12 '25 16:03 EpicMulletGuy

Click the 3 dots next to fluxgem in launcher then go to "python packages". Search for "bitsandbytes" and select 0.45.3 from the dropdown and press the blue icon to update. Now run it again should work.

Thank you very much, Carlos. Updating the version of "Sand" does indeed make the "no-gpu-support" warning disappear. But that's something I've already achieved in another way, and it definitely doesn't fix the main problem: The software freezes a few seconds after starting the workout. Looking at my "Task Manager," I could see that it freezes at the moment where it transfers data from RAM to VRAM. By the way, this whole process works fine in "FluxGym" running under Pinokio on the same PC. But I'd like to be able to use everything in "Stabilty Matrix."

mariano-mena avatar Mar 13 '25 00:03 mariano-mena

Same problem, 3060/12GB, fresh StabilityMatrix installation couple days ago. BTW, also flux1-dev model is not downloaded as expected (sft extension by default?).

user6927 avatar Apr 05 '25 12:04 user6927

Same problem

go to "python packages". Search for "bitsandbytes" and select 0.45.3

Unfortunately it doesnt work. I tried a default 0.44.0, 0.45.3 and the lastest 0.45.5.

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable

Lora training stops on a caching latents The problem appeared after building a new pc. The only thing left from the old PC is the video card - 4070

foxhoundunit avatar May 06 '25 09:05 foxhoundunit

FINALLY Solved

MY SOLUTION (required: basic knowledge of the Windows Shell)

  1. Install PyTorch with CUDA support via Command Prompt

1.1 Activate the virtual environment: cd D:\StabilityMatrix\Packages\FluxGym .\venv\Scripts\activate.bat

1.2 Uninstall the CPU-only version of PyTorch: pip uninstall -y torch torchvision torchaudio

1.3 Install the official CUDA-enabled build (example for CUDA 12.1): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

1.4 Verify in the Python REPL (launch python and type the following): import torch torch.cuda.is_available() → should say "True" torch.cuda.device_count() → should return "1" (or more)

  1. Relaunch FluxGym

Done! With this fix, FluxGym inside Stability Matrix runs correctly again.

I'm posting this because I couldn't find an effective solution documented anywhere else. None of the other suggestions I came across over many months actually worked. If this solution worked for you please share in the comments!

Long live Stability Matrix!!! Mariano from Buenos Aires

mariano-mena avatar Jun 22 '25 05:06 mariano-mena

This issue is stale because it has been open 60 days with no activity. Remove the stale label or comment, else this will be closed in 7 days.

github-actions[bot] avatar Aug 22 '25 02:08 github-actions[bot]

This issue was closed because it has been stale for 7 days with no activity.

github-actions[bot] avatar Aug 30 '25 02:08 github-actions[bot]