Training on 2x H100 on Ubuntu and speed is same as 1x H100 what we are doing wrong?
When training batch size 4 on H100 the speed is 1.27 second / it
When training batch size 4 on 2x H100 the speed is 2.05 second / it
So basically we almost got no speed boost from multiple GPU training
Is this expected? I am training on SDXL RealVis XL model with 1024 no bucketing
We are using latest bmaltais Kohya GUI on Ubuntu with the below multi-gpu configuration
@kohya-ss @bmaltais
this below is training json config
{
"adaptive_noise_scale": 0,
"additional_parameters": "--max_grad_norm=0.0 --no_half_vae --train_text_encoder",
"async_upload": false,
"bucket_no_upscale": true,
"bucket_reso_steps": 64,
"cache_latents": true,
"cache_latents_to_disk": true,
"caption_dropout_every_n_epochs": 0,
"caption_dropout_rate": 0,
"caption_extension": "",
"clip_skip": 1,
"color_aug": false,
"dataset_config": "",
"debiased_estimation_loss": false,
"dynamo_backend": "no",
"dynamo_mode": "default",
"dynamo_use_dynamic": false,
"dynamo_use_fullgraph": false,
"enable_bucket": false,
"epoch": 50,
"extra_accelerate_launch_args": "",
"flip_aug": false,
"full_bf16": true,
"full_fp16": false,
"gpu_ids": "1,2",
"gradient_accumulation_steps": 1,
"gradient_checkpointing": false,
"huber_c": 0.1,
"huber_schedule": "snr",
"huggingface_path_in_repo": "",
"huggingface_repo_id": "",
"huggingface_repo_type": "",
"huggingface_repo_visibility": "",
"huggingface_token": "",
"ip_noise_gamma": 0,
"ip_noise_gamma_random_strength": false,
"keep_tokens": 0,
"learning_rate": 8e-06,
"learning_rate_te": 1e-05,
"learning_rate_te1": 3e-06,
"learning_rate_te2": 0,
"log_tracker_config": "",
"log_tracker_name": "",
"log_with": "",
"logging_dir": "",
"loss_type": "l2",
"lr_scheduler": "constant",
"lr_scheduler_args": "",
"lr_scheduler_num_cycles": 1,
"lr_scheduler_power": 1,
"lr_warmup": 0,
"main_process_port": 0,
"masked_loss": false,
"max_bucket_reso": 2048,
"max_data_loader_n_workers": 0,
"max_resolution": "1024,1024",
"max_timestep": 1000,
"max_token_length": 75,
"max_train_epochs": 0,
"max_train_steps": 0,
"mem_eff_attn": false,
"metadata_author": "",
"metadata_description": "",
"metadata_license": "",
"metadata_tags": "",
"metadata_title": "",
"min_bucket_reso": 256,
"min_snr_gamma": 0,
"min_timestep": 0,
"mixed_precision": "bf16",
"model_list": "custom",
"multi_gpu": true,
"multires_noise_discount": 0,
"multires_noise_iterations": 0,
"no_token_padding": false,
"noise_offset": 0,
"noise_offset_random_strength": false,
"noise_offset_type": "Original",
"num_cpu_threads_per_process": 4,
"num_machines": 1,
"num_processes": 2,
"optimizer": "Adafactor",
"optimizer_args": "scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01",
"output_dir": "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion",
"output_name": "shoes_test_2",
"persistent_data_loader_workers": false,
"pretrained_model_name_or_path": "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/RealVisXL_V4.0.safetensors",
"prior_loss_weight": 1,
"random_crop": false,
"reg_data_dir": "",
"resume": "",
"resume_from_huggingface": "",
"sample_every_n_epochs": 0,
"sample_every_n_steps": 0,
"sample_prompts": "",
"sample_sampler": "euler_a",
"save_every_n_epochs": 10,
"save_every_n_steps": 0,
"save_last_n_steps": 0,
"save_last_n_steps_state": 0,
"save_model_as": "safetensors",
"save_precision": "bf16",
"save_state": false,
"save_state_on_train_end": false,
"save_state_to_huggingface": false,
"scale_v_pred_loss_like_noise_pred": false,
"sdxl": true,
"seed": 0,
"shuffle_caption": false,
"stop_text_encoder_training": 0,
"train_batch_size": 4,
"train_data_dir": "/home/Ubuntu/Desktop/shoes_train_datasets/test1/img",
"v2": false,
"v_parameterization": false,
"v_pred_like_loss": 0,
"vae": "stabilityai/sdxl-vae",
"vae_batch_size": 8,
"wandb_api_key": "",
"wandb_run_name": "",
"weighted_captions": false,
"xformers": "xformers"
}
TOML file
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
clip_skip = 1
dynamo_backend = "no"
epoch = 50
full_bf16 = true
gradient_accumulation_steps = 1
huber_c = 0.1
huber_schedule = "snr"
learning_rate = 8e-6
learning_rate_te1 = 3e-6
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_timestep = 1000
max_token_length = 75
max_train_steps = 1175
min_bucket_reso = 256
mixed_precision = "bf16"
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
optimizer_type = "Adafactor"
output_dir = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion"
output_name = "shoes_test_2"
pretrained_model_name_or_path = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/RealVisXL_V4.0.safetensors"
prior_loss_weight = 1
resolution = "1024,1024"
sample_prompts = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
train_batch_size = 4
train_data_dir = "/home/Ubuntu/Desktop/shoes_train_datasets/test1/img"
vae = "stabilityai/sdxl-vae"
vae_batch_size = 8
xformers = true
You could probably provide a copy of the toml as this is what as-scripts ultimately consume and it should make it easier for @kohya-ss to troubleshoot without being concerned with the gui config.
Many users have been complaining about issues with multiple GPU so I am curious to learn if perhaps it is something I am doing wrong with the gui… like not properly handling of parameters or actually not allowing needed parameters to be entered.
You could probably provide a copy of the toml as this is what as-scripts ultimately consume and it should make it easier for @kohya-ss to troubleshoot without being concerned with the gui config.
Many users have been complaining about issues with multiple GPU so I am curious to learn if perhaps it is something I am doing wrong with the gui… like not properly handling of parameters or actually not allowing needed parameters to be entered.
here it is
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
clip_skip = 1
dynamo_backend = "no"
epoch = 50
full_bf16 = true
gradient_accumulation_steps = 1
huber_c = 0.1
huber_schedule = "snr"
learning_rate = 8e-6
learning_rate_te1 = 3e-6
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_timestep = 1000
max_token_length = 75
max_train_steps = 1175
min_bucket_reso = 256
mixed_precision = "bf16"
noise_offset_type = "Original"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False", "weight_decay=0.01",]
optimizer_type = "Adafactor"
output_dir = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion"
output_name = "shoes_test_2"
pretrained_model_name_or_path = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/RealVisXL_V4.0.safetensors"
prior_loss_weight = 1
resolution = "1024,1024"
sample_prompts = "/home/Ubuntu/apps/stable-diffusion-webui/models/Stable-diffusion/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 10
save_model_as = "safetensors"
save_precision = "bf16"
train_batch_size = 4
train_data_dir = "/home/Ubuntu/Desktop/shoes_train_datasets/test1/img"
vae = "stabilityai/sdxl-vae"
vae_batch_size = 8
xformers = true
@aria1th @BootsofLagrangian any ideas?
AFAIK batch size is per device, so the effective batch size is 4x2 = 8, which is why it's about half as fast. To get the same global batch size you need to divide by the number of devices, but this is a ridiculously small batch size considering you're using H100s and most of your time is being wasted on communication overhead between cards. You should be jacking the batch size way up
AFAIK batch size is per device, so the effective batch size is 4x2 = 8, which is why it's about half as fast. To get the same global batch size you need to divide by the number of devices, but this is a ridiculously small batch size considering you're using H100s and most of your time is being wasted on communication overhead between cards. You should be jacking the batch size way up
i know it is. each gpu could go up maximum 7 batch size i tested. still wouldn't make difference since the communication overhead is just crazy. before this new multi gpu training system it was way faster. i was doing dual T4 gpu training on Kaggle and there were almost no such communication delay. moreover with new system i never could make it work on Kaggle either
The slight performance degradation is expected due to communication overhead, its normal. Its more bottlenecked by system itself hardware - which is why everyone is trying to have "less communication bottleneck system" and even B100 / B200 / etc, as NVIDIA says. Batch size makes drastic differences, yes, so you must make it as high as your card can handle. But, if your system is flawed - like H100 in NFS storage (wtf?) or bad system (bandwidth) then you can't get any advantages from it.
GCP always knew that hardware is the most important one - you would never get bottleneck from that, but if you're using other service provider, you should check the factors...
But if its 'version dependent' then uhh..... kohya script does not handle communication, accelerate does it...
@aria1th this was on same machine rented on Massed Compute
what hardware i have to check? this speed loss is just huge. maybe i am doing something wrong?
mainboard, storage, RAM, CPU.... bottleneck can happen from various causes.... and you have to check them all first
mainboard, storage, RAM, CPU.... bottleneck can happen from various causes.... and you have to check them all first
i doubt that any of them is the cause. you get a very powerful VM. also single GPU speed looks very accurate
so lets say any of them is the cause how to debug it?
Have you recentrly tried to use the version that used to work fine on the same system? It is possible the hosting has changed the type of machine and it is resulting in this issue?
If the speed is back up, then you could provide kohya with the information about what sd-scripts code base used to work best and he might be able to pin-point where the speed issue is coming from?
Have you recentrly tried to use the version that used to work fine on the same system? It is possible the hosting has changed the type of machine and it is resulting in this issue?
If the speed is back up, then you could provide kohya with the information about what sd-scripts code base used to work best and he might be able to pin-point where the speed issue is coming from?
it was a very long time ago that i used dual speed successfully. 7 months ago i have a video :D i can try maybe
Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.
Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.
just asked them lets see what they tell. can we see it somehow on the machine with a command etc?
Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.
ok it turns out all are PCIe. so i assume we can't get any better right?
Do your H100s connect via NVLink? or just PCIe? If PCIe is, speed degradation occurs due to PCIe communication bottleneck.
ok it turns out all are PCIe. so i assume we can't get any better right?
Okay, there is a hardware bottleneck. And I think you can get faster total training time using two H100s, not time per step.
i.e. One H100 : 1.27 s/it vs Two H100s 2.07 s/it => two independ H100s 2.54 s/it < (faster) < Two DDP H100s 2.07 s/it
If you have a budget to buy NVLink, it is faster way to speed up your H100s. If you dont want to buy it, XD
Additionally, speed degradation due to communication is not your fault. It is just H100 has super faster memory bandwidth than PCIe, e.g. H100 (2TB /s) vs PCIe 4.0 ( 16Gb / s )
@BootsofLagrangian it is not like i purchased them i am using on Massed Compute :)
They said they have SXM4 A100. I will test the script there. It is supposed to not get degraded speed like this. We will see :)
@BootsofLagrangian it is not like i purchased them i am using on Massed Compute :)
They said they have SXM4 A100. I will test the script there. It is supposed to not get degraded speed like this. We will see :)
Most of SXM4 system runs on interconnected device(NVLink, NVSwitch). So no degradation is natural, but most of PCIe system dose not. PCIe powered GPU needs external interlink device.
started a machine will try to test now
@kohya-ss the training fails on a SXM4 machine :(
when 1 gpu is used it works
here batch size 7 speed
When I try 2 GPU like below it fails
tested all of the dynamo backends all failed
00:43:55-975677 INFO Start training Dreambooth...
00:43:55-976776 INFO Validating lr scheduler arguments...
00:43:55-977355 INFO Validating optimizer arguments...
00:43:55-977896 INFO Validating /home/Ubuntu/Desktop/results existence and
writability... SUCCESS
00:43:55-978494 INFO Validating
/home/Ubuntu/Downloads/RealVisXL_V4.0.safetensors
existence... SUCCESS
00:43:55-979055 INFO Validating /home/Ubuntu/Desktop/train_imgs existence...
SUCCESS
00:43:55-979627 INFO Validating stabilityai/sdxl-vae existence... SKIPPING:
huggingface.co model
00:43:55-980219 INFO Folder 1_ohwx man: 1 repeats found
00:43:55-981209 INFO Folder 1_ohwx man: 480 images found
00:43:55-981777 INFO Folder 1_ohwx man: 480 * 1 = 480 steps
00:43:55-982305 INFO Regulatization factor: 1
00:43:55-982809 INFO Total steps: 480
00:43:55-983280 INFO Train batch size: 7
00:43:55-983730 INFO Gradient accumulation steps: 1
00:43:55-984195 INFO Epoch: 400
00:43:55-984662 INFO max_train_steps (480 / 7 / 1 * 400 * 1) = 27429
00:43:55-985243 INFO lr_warmup_steps = 0
00:43:55-986084 INFO Saving training config to
/home/Ubuntu/Desktop/results/2_gpu_20240723-004355.json
...
00:43:55-986976 INFO Executing command:
/home/Ubuntu/Desktop/kohya_ss/venv/bin/accelerate
launch --dynamo_backend no --dynamo_mode default
--gpu_ids 0,1 --mixed_precision bf16 --multi_gpu
--num_processes 2 --num_machines 1
--num_cpu_threads_per_process 4
/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py
--config_file
/home/Ubuntu/Desktop/results/config_dreambooth-20240723
-004355.toml --max_grad_norm=0.0 --no_half_vae
--train_text_encoder --learning_rate_te2=0
00:43:55-988822 INFO Command executed.
2024-07-23 00:44:03.295128: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-23 00:44:03.295172: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-23 00:44:03.296055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-23 00:44:03.301057: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-23 00:44:03.364694: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-23 00:44:03.364751: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-23 00:44:03.367112: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-23 00:44:03.374010: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-23 00:44:03.932149: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-23 00:44:04.062733: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-07-23 00:44:04 INFO Loading settings from train_util.py:3744
/home/Ubuntu/Desktop/results/con
fig_dreambooth-20240723-004355.t
oml...
INFO /home/Ubuntu/Desktop/results/con train_util.py:3763
fig_dreambooth-20240723-004355
WARNING clip_skip will be unexpected sdxl_train_util.py:343
/
SDXL学習ではclip_skipは動作
しません
2024-07-23 00:44:04 INFO prepare tokenizers sdxl_train_util.py:134
2024-07-23 00:44:04 INFO Loading settings from train_util.py:3744
/home/Ubuntu/Desktop/results/con
fig_dreambooth-20240723-004355.t
oml...
INFO /home/Ubuntu/Desktop/results/con train_util.py:3763
fig_dreambooth-20240723-004355
WARNING clip_skip will be unexpected sdxl_train_util.py:343
/
SDXL学習ではclip_skipは動作
しません
2024-07-23 00:44:04 INFO prepare tokenizers sdxl_train_util.py:134
INFO update token length: 75 sdxl_train_util.py:159
INFO Using DreamBooth method. sdxl_train.py:144
2024-07-23 00:44:05 INFO prepare images. train_util.py:1572
INFO found directory train_util.py:1519
/home/Ubuntu/Desktop/train_imgs/
1_ohwx man contains 480 image
files
2024-07-23 00:44:05 INFO update token length: 75 sdxl_train_util.py:159
WARNING No caption file found for 480 train_util.py:1550
images. Training will continue
without captions for these
images. If class token exists,
it will be used. /
480枚の画像にキャプションファイ
ルが見つかりませんでした。これら
の画像についてはキャプションなし
で学習を続行します。class
tokenが存在する場合はそれを使い
ます。
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(10th copy).jpg
INFO Using DreamBooth method. sdxl_train.py:144
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(11th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(12th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(13th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(14th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1555
1_ohwx man/IMG_20230430_134600
(15th copy).jpg... and 475 more
INFO 480 train images with repeating. train_util.py:1613
INFO 0 reg images. train_util.py:1616
WARNING no regularization images / train_util.py:1621
正則化画像が見つかりませんでした
INFO [Dataset 0] config_util.py:565
batch_size: 7
resolution: (1024, 1024)
enable_bucket: False
network_multiplier: 1.0
[Subset 0 of Dataset 0]
image_dir:
"/home/Ubuntu/Desktop/train_imgs
/1_ohwx man"
image_count: 480
num_repeats: 1
shuffle_caption: False
keep_tokens: 0
keep_tokens_separator:
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoc
hes: 0
caption_tag_dropout_rate:
0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: ohwx man
caption_extension: .caption
INFO [Dataset 0] config_util.py:571
INFO loading image sizes. train_util.py:853
100%|█████████████████████████████████████| 480/480 [00:00<00:00, 107174.12it/s]
INFO prepare dataset train_util.py:861
INFO prepare accelerator sdxl_train.py:201
accelerator device: cuda:0
INFO loading model for process 0/2 sdxl_train_util.py:30
INFO load StableDiffusion sdxl_train_util.py:70
checkpoint:
/home/Ubuntu/Downloads/RealVi
sXL_V4.0.safetensors
INFO building U-Net sdxl_model_util.py:192
INFO loading U-Net from sdxl_model_util.py:196
checkpoint
INFO prepare images. train_util.py:1572
INFO found directory train_util.py:1519
/home/Ubuntu/Desktop/train_imgs/
1_ohwx man contains 480 image
files
WARNING No caption file found for 480 train_util.py:1550
images. Training will continue
without captions for these
images. If class token exists,
it will be used. /
480枚の画像にキャプションファイ
ルが見つかりませんでした。これら
の画像についてはキャプションなし
で学習を続行します。class
tokenが存在する場合はそれを使い
ます。
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(10th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(11th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(12th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(13th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1557
1_ohwx man/IMG_20230430_134600
(14th copy).jpg
WARNING /home/Ubuntu/Desktop/train_imgs/ train_util.py:1555
1_ohwx man/IMG_20230430_134600
(15th copy).jpg... and 475 more
INFO 480 train images with repeating. train_util.py:1613
INFO 0 reg images. train_util.py:1616
WARNING no regularization images / train_util.py:1621
正則化画像が見つかりませんでした
INFO [Dataset 0] config_util.py:565
batch_size: 7
resolution: (1024, 1024)
enable_bucket: False
network_multiplier: 1.0
[Subset 0 of Dataset 0]
image_dir:
"/home/Ubuntu/Desktop/train_imgs
/1_ohwx man"
image_count: 480
num_repeats: 1
shuffle_caption: False
keep_tokens: 0
keep_tokens_separator:
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoc
hes: 0
caption_tag_dropout_rate:
0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: ohwx man
caption_extension: .caption
INFO [Dataset 0] config_util.py:571
INFO loading image sizes. train_util.py:853
100%|█████████████████████████████████████| 480/480 [00:00<00:00, 103345.10it/s]
INFO prepare dataset train_util.py:861
INFO prepare accelerator sdxl_train.py:201
accelerator device: cuda:1
2024-07-23 00:44:06 INFO U-Net: <All keys matched sdxl_model_util.py:202
successfully>
INFO building text encoders sdxl_model_util.py:205
INFO loading text encoders from sdxl_model_util.py:258
checkpoint
INFO text encoder 1: <All keys sdxl_model_util.py:272
matched successfully>
INFO text encoder 2: <All keys sdxl_model_util.py:276
matched successfully>
INFO building VAE sdxl_model_util.py:279
INFO loading VAE from checkpoint sdxl_model_util.py:284
INFO VAE: <All keys matched sdxl_model_util.py:287
successfully>
INFO load VAE: stabilityai/sdxl-vae model_util.py:1268
INFO additional VAE loaded sdxl_train_util.py:128
2024-07-23 00:44:07 INFO loading model for process 1/2 sdxl_train_util.py:30
INFO load StableDiffusion sdxl_train_util.py:70
checkpoint:
/home/Ubuntu/Downloads/RealVi
sXL_V4.0.safetensors
INFO building U-Net sdxl_model_util.py:192
INFO loading U-Net from sdxl_model_util.py:196
checkpoint
2024-07-23 00:44:08 INFO U-Net: <All keys matched sdxl_model_util.py:202
successfully>
INFO building text encoders sdxl_model_util.py:205
INFO loading text encoders from sdxl_model_util.py:258
checkpoint
2024-07-23 00:44:09 INFO text encoder 1: <All keys sdxl_model_util.py:272
matched successfully>
INFO text encoder 2: <All keys sdxl_model_util.py:276
matched successfully>
INFO building VAE sdxl_model_util.py:279
INFO loading VAE from checkpoint sdxl_model_util.py:284
INFO VAE: <All keys matched sdxl_model_util.py:287
successfully>
INFO load VAE: stabilityai/sdxl-vae model_util.py:1268
INFO additional VAE loaded sdxl_train_util.py:128
Disable Diffusers' xformers
INFO Enable xformers for U-Net train_util.py:2660
2024-07-23 00:44:09 INFO Enable xformers for U-Net train_util.py:2660
INFO [Dataset 0] train_util.py:2079
INFO caching latents. train_util.py:974
INFO checking cache validity... train_util.py:984
100%|█████████████████████████████████████| 480/480 [00:00<00:00, 946083.61it/s]
INFO [Dataset 0] train_util.py:2079
INFO caching latents. train_util.py:974
INFO checking cache validity... train_util.py:984
100%|███████████████████████████████████████| 480/480 [00:00<00:00, 2265.38it/s]
2024-07-23 00:44:10 INFO caching latents... train_util.py:1021
0it [00:00, ?it/s]
enable text encoder training
2024-07-23 00:44:10 INFO use Adafactor optimizer | train_util.py:4047
{'scale_parameter': False,
'relative_step': False,
'warmup_init': False,
'weight_decay': 0.01}
WARNING constant_with_warmup will be train_util.py:4079
good /
スケジューラはconstant_with_warm
upが良いかもしれません
train unet: True, text_encoder1: True, text_encoder2: False
number of models: 2
number of trainable parameters: 2690524164
prepare optimizer, data loader etc.
INFO use Adafactor optimizer | train_util.py:4047
{'scale_parameter': False,
'relative_step': False,
'warmup_init': False,
'weight_decay': 0.01}
WARNING constant_with_warmup will be train_util.py:4079
good /
スケジューラはconstant_with_warm
upが良いかもしれません
enable full bf16 training.
running training / 学習開始
num examples / サンプル数: 480
num batches per epoch / 1epochのバッチ数: 35
num epochs / epoch数: 784
batch size per device / バッチサイズ: 7
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 27429
steps: 0%| | 0/27429 [00:00<?, ?it/s]
epoch 1/784
Traceback (most recent call last):
File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py", line 818, in <module>
train(args)
File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py", line 591, in train
noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 680, in forward
return model_forward(*args, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 668, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/library/sdxl_original_unet.py", line 1079, in forward
t_emb = get_timestep_embedding(timesteps, self.model_channels, downscale_freq_shift=0) # , repeat_only=False)
File "/home/Ubuntu/Desktop/kohya_ss/sd-scripts/library/sdxl_original_unet.py", line 257, in get_timestep_embedding
exponent = exponent / (half_dim - downscale_freq_shift)
RuntimeError: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a71e4e84617 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a71e4e3f98d in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a71e4f35c38 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7a717413c8b0 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7a71741406d8 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7a7174156f70 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7a7174157278 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7a71e44dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7a7210094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7a7210126850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: CUDA error: uncorrectable ECC error encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7a71e4e84617 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7a71e4e3f98d in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7a71e4f35c38 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7a717413c8b0 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7a71741406d8 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7a7174156f70 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7a7174157278 in /home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7a71e44dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7a7210094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7a7210126850 in /lib/x86_64-linux-gnu/libc.so.6)
[2024-07-23 00:44:15,053] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 14854 closing signal SIGTERM
[2024-07-23 00:44:15,618] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 14855) of binary: /home/Ubuntu/Desktop/kohya_ss/venv/bin/python
Traceback (most recent call last):
File "/home/Ubuntu/Desktop/kohya_ss/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/Ubuntu/Desktop/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/Ubuntu/Desktop/kohya_ss/sd-scripts/sdxl_train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-23_00:44:15
host : 0229-dsm-prxmx30035
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 14855)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 14855
============================================================
00:44:17-252060 INFO Training has ended.
I've been training multi-gpu for months using both the gui and the CLI. I think this issue might be related to the CUDA version itself more than kohya. I've had this happen to me once in the past where it couldn't register some specific cuda services. I mostly use runpod and i don't have any issues whether it's H100 NVL, PCIE, SXM. Most of the time i train with 6x L40S since it is faster, cheaper and more memory than 3x H100. What i'd like to know is how to enable Sparsity since Sparsity doubles the performance of FP operations.
The lack of FlashAttention 3 is rearing its ugly head, we don't even have TMA for the H100 in kohya among other stuff.
I've been training multi-gpu for months using both the gui and the CLI. I think this issue might be related to the CUDA version itself more than kohya. I've had this happen to me once in the past where it couldn't register some specific cuda services. I mostly use runpod and i don't have any issues whether it's H100 NVL, PCIE, SXM. Most of the time i train with 6x L40S since it is faster, cheaper and more memory than 3x H100. What i'd like to know is how to enable Sparsity since Sparsity doubles the performance of FP operations.
multi gpu training worked on PCIe machine on massed compute . but with SXM i got above error. do you know how to fix? how do you setup your accelerator?
what cuda version you have on your SXM machine?
@bmaltais there is nothing wrong with your interface or kohya's script, you've done a great job, altho' some descriptions you have in there are not totally accurate but that's not your fault.
CUDA error: uncorrectable ECC error encountered
This is an hardware error. You should contact the compute provider because you've got a faulty node.
CUDA error: uncorrectable ECC error encounteredThis is an hardware error. You should contact the compute provider because you've got a faulty node.
thanks i did. it could be reason