diffusers
diffusers copied to clipboard
About dataloader_num_workers in train_text_to_image_lora.py
Describe the bug
I can run train_text_to_image_lora.py with dataloader_num_workers=0. But it does not work with dataloader_num_workers>0.
Reproduction
I set dataloader_num_workers=4, here is the ouput.
The following values were not passed to accelerate launch
and had defaults used instead:
--num_processes
was set to a value of 1
--num_machines
was set to a value of 1
--dynamo_backend
was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
.
04/12/2024 10:38:20 - INFO - main - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'prediction_type', 'timestep_spacing', 'rescale_betas_zero_snr', 'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type', 'thresholding', 'sample_max_value'} was not found in config. Values will be initialized to default
values.
{'force_upcast', 'scaling_factor', 'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'only_cross_attention', 'num_attention_heads', 'encoder_hid_dim', 'dropout', 'time_cond_proj_dim', 'time_embedding_dim', 'encoder_hid_dim_type', 'attention_type', 'dual_cross_attention', 'resnet_out_scale_factor', 'projection_class
embeddings_input_dim', 'num_class_embeds', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_type', 'conv_out_kernel', 'conv_in_kernel', 'transformer_layers_per_block', 'mid_block_only_cross_attention', 'use_linear_pro
jection', 'mid_block_type', 'timestep_post_act', 'upcast_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'class_embed_type', 'resnet_skip_time_act', 'reverse_transformer_layers_per_block', 'addition_embed_type_num
heads', 'time_embedding_act_fn', 'resnet_time_scale_shift'} was not found in config. Values will be initialized to default values.
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<?, ?it/s]
04/12/2024 10:38:24 - WARNING - datasets.builder - Found cached dataset imagefolder (C:/Users/HP/.cache/huggingface/datasets/imagefolder/default-f890b3e0a49a7f2c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f
)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 503.46it/s]
04/12/2024 10:38:25 - INFO - main - ***** Running training *****
04/12/2024 10:38:25 - INFO - main - Num examples = 20
04/12/2024 10:38:25 - INFO - main - Num Epochs = 100
04/12/2024 10:38:25 - INFO - main - Instantaneous batch size per device = 1
04/12/2024 10:38:25 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
04/12/2024 10:38:25 - INFO - main - Gradient Accumulation steps = 4
04/12/2024 10:38:25 - INFO - main - Total optimization steps = 500
Steps: 0%| | 0/500 [00:00<?, ?it/s]T
raceback (most recent call last):
File "D:\work\projects\diffusers\examples\text_to_image\train_text_to_image_lora.py", line 1014, in
(py312) D:\work\projects\diffusers\examples\text_to_image>Traceback (most recent call last):
File "
Logs
No response
System Info
-
diffusers
version: 0.28.0.dev0 - Platform: Windows-10-10.0.19045-SP0
- Python version: 3.12.2
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Huggingface_hub version: 0.21.4
- Transformers version: 4.39.1
- Accelerate version: 0.28.0
- xFormers version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help?
No response
Could you be sure that you run accelerate config
command and arrange it properly before starting to train?
@Hellcat1005 It is difficult to debug this without a reproducible example. What dataset are you trying to use here? Is it a custom one? If you try running with dataloader_num_workers>0
with the default dataset lambdalabs/pokemon-blip-captions
does the error still persist?
Could you be sure that you run
accelerate config
command and arrange it properly before starting to train?
The command I ran is as follows.
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py --pretrained_model_name_or_path="D:/work/projects/huggingface_weights/models--runwayml--stable-diffusi on-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9" --train_data_dir="D:/work/data/mouse/10" --num_train_epochs=100 --output_dir="./experiments/data10/exp1/weights" --mixed_precision="fp16" --dataloader_num_workers=2
@Hellcat1005 It is difficult to debug this without a reproducible example. What dataset are you trying to use here? Is it a custom one? If you try running with
dataloader_num_workers>0
with the default datasetlambdalabs/pokemon-blip-captions
does the error still persist?
I use custom dataset. I can not use lambdalabs/pokemon-blip-captions now. I am waiting for the author to approve the application, but it seems to be a bit slow. The command I run is as follows. It works with dataloader_num_workers=0.
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py --pretrained_model_name_or_path="D:/work/projects/huggingface_weights/models--runwayml--stable-diffusi on-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9" --train_data_dir="D:/work/data/mouse/10" --num_train_epochs=100 --output_dir="./experiments/data10/exp1/weights" --mixed_precision="fp16" --dataloader_num_workers=2
@DN6 @Hellcat1005 I also found this issue while increasing the dataloader_num_workers for pretty much any dataset for this script. My issue was solved when moving to Ubuntu/WSL. So, I think this is a Windows-specific issue. The reason it happens is preprocess_train is a local function in main and can't be pickled when having multiple dataset workers(this may be specific to windows). A similar issue is this.
If you want to make it work in windows, the main solution is to make preprocess_train/collate_fn and the like global functions like in here
Thanks for investigating @isamu-isozaki! Hmm so the core issue seems to be with Pytorch multiprocessing on Windows then? Perhaps @Hellcat1005 you can modify the script to move the functions @isamu-isozaki has mentioned outside of main or run the script on an Ubuntu/WSL machine?
We can look into restructuring the training to avoid the issue on Windows but since all the scripts follow a similar structure, this would be an involved task at the moment for us.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.